1. Project Introduction and Setup

Kia ora! This report documents my analysis of the New Zealand housing market. My goal was to integrate various datasets related to property prices, rentals, and dwelling tenure to uncover key trends and build predictive models, primarily focusing on the Rent Consumer Price Index (Rent CPI). This work involved extensive data wrangling, exploratory data analysis, feature engineering, machine learning, and initial steps into time series forecasting. I aimed for a robust pipeline that could eventually support predictions out to 2030.

1.1. Objectives

My main objectives for this project were: * To source, clean, and integrate disparate housing datasets into a cohesive analytical base. * To perform a comprehensive Exploratory Data Analysis (EDA) to understand historical trends, seasonality, and regional variations in key housing metrics. * To engineer relevant features that could capture underlying market dynamics and improve predictive model performance. * To build, tune, and evaluate several machine learning models to predict Rent CPI. * To explore and implement initial time series forecasting techniques for future Rent CPI prediction, comparing different methodologies.

1.2. Data Sources

The analysis is based on three main CSV files I obtained: * rental.csv: This file contains rental CPI data, which includes breakdowns by broader New Zealand regions. * property.csv: This dataset provides median property prices for a more granular set of locations across the country. * dwellings.csv: This file offers national-level statistics on dwelling tenure, such as the number of owner-occupied, rented, and freely provided dwellings.

1.3. Setting up the R Environment

First things first, I need to load all the R libraries that I’ll be using throughout this analysis. I’ve grouped them by their primary function (data manipulation, time series, machine learning, plotting). I’ve made a note to load plyr before dplyr if I were to use specific plyr functions, as dplyr (loaded via tidyverse) has some functions with the same names. For this project, dplyr is my main tool for data wrangling. tidymodels is a fantastic collection of packages for a modern approach to modelling, and caret is also very useful for its train function and model evaluation tools.

# Data Manipulation & Core
library(plyr)      # load it BEFORE dplyr to avoid conflicts.
library(tidyverse) # Includes dplyr, ggplot2, readr, etc.
library(lubridate) # For date-time manipulation.
library(janitor)   # For cleaning column names.
library(zoo)       # For as.yearmon, rollapply, rollmean.

# Time Series Specific
library(tseries)   # For time series tests like adf.test
library(forecast)  # For ARIMA, auto.arima, forecast, checkresiduals, ndiffs

# Machine Learning & Evaluation
library(tidymodels)  # Meta-package for modern modelling: recipes, rsample, parsnip, yardstick, etc.
# library(recipes)     # For ML preprocessing (loaded by tidymodels)
# library(rsample)     # For initial_split (loaded by tidymodels)
library(caret)       # For ML model training (can also use tidymodels workflows)
library(glmnet)      # For Lasso/Elastic Net regression.
library(randomForest)# For Random Forest algorithm.
library(xgboost)     # For XGBoost algorithm.
# library(Metrics)   # For RMSE - prefer yardstick::rmse_vec from tidymodels, loaded via 'tidymodels'

# Plotting & Visualization
library(corrplot)    # For correlation matrix plots.
library(ggthemes)    # For additional ggplot themes.
library(scales)      # For formatting axes (dollar, percent, comma).
library(RColorBrewer)# For colour palettes.
library(ggpubr)      # For easily adding stats like correlation to ggplot (e.g., stat_cor).
library(ggrepel)     # For geom_text_repel to avoid overlapping text labels in plots.

With the environment set up, I can now proceed to load and process the data.

2. Data Loading and Initial Standardisation

The first step in any analysis is getting the data in and ensuring it’s in a clean, usable format. The column names in the raw files needed a bit of tidying (using janitor::clean_names()), and the time_ref column, which represents the time period, needed to be converted into a consistent R Date object for proper time series analysis and merging.

# Reading in my datasets.
# I'm using read_csv from readr (part of tidyverse) for fast and friendly CSV reading,
# and piping straight into clean_names() from janitor to sort out any messy column names.
rental_raw <- read_csv("rental.csv") %>% clean_names()
property_raw <- read_csv("property.csv") %>% clean_names()
dwellings_raw <- read_csv("dwellings.csv") %>% clean_names()

# It's always a good idea to have a quick look at the first few rows and structure
# of the raw data to understand what I'm dealing with.
# For this report, I'll just show a sample of one.
print("Sample of raw rental data:")
## [1] "Sample of raw rental data:"
head(rental_raw)

My initial check of the data showed that the time_ref column was key for linking datasets but wasn’t in a standard date format.

2.1. Standardising Time References

The time_ref column in the datasets was numeric (e.g., 2007.01 for January 2007). Occasionally, it appeared as 2007.1, which I interpreted as October 2007 — effectively 2007.10. To ensure consistent parsing, I wrote a function to convert these values into proper R Date objects set to the first day of the respective month. This step is essential for reliable time-based operations and accurate merging of datasets.

# This function converts YYYY.MM-style 'time_ref' values into actual R Date objects.
# It uses sprintf to force two decimal places, so '2007.1' becomes '2007.10'.
convert_time_ref_to_date <- function(df, time_ref_col_name = "time_ref") {
  df <- df %>%
    mutate(
      # Standardise to ensure two decimal places, e.g., 2007.1 becomes 2007.10
      time_ref_numeric = as.numeric(sprintf("%.2f", .data[[time_ref_col_name]])),
      # Extract the year
      year = floor(time_ref_numeric),
      # Extract the month by multiplying the decimal part by 100
      month_val = round((time_ref_numeric %% 1) * 100),
      # Create a proper Date object
      date = make_date(year, month_val, 1)
    ) %>%
    select(-time_ref_numeric) # Drop intermediate column
  return(df)
}

# Apply the function to all raw datasets
rental <- convert_time_ref_to_date(rental_raw, "time_ref")
property <- convert_time_ref_to_date(property_raw, "time_ref")
dwellings <- convert_time_ref_to_date(dwellings_raw, "time_ref")

# Quick validation
print("Sample of Rental Data with new 'date' Column:")
## [1] "Sample of Rental Data with new 'date' Column:"
head(select(rental, time_ref, date, year, month_val, rent_cpi, region))

This standardisation step is vital. With consistent date columns across datasets, I can now reliably join and compare observations over time. This paves the way for the next step: integrating the different data sources.

3. Data Integration: Mapping and Merging

Now that my individual datasets were loaded and had standardised date columns, the next big step was to integrate them. This was probably one of the trickiest parts because the property.csv data had very specific locations, while rental.csv used broader regional categories (like “Rest of North Island”). To combine these meaningfully for a regional analysis, I had to:

  1. Identify which specific locations in property.csv would roll up into the broader rental regions.
  2. For categories like “Rest of North Island” and “Rest of South Island” in the rental data, I decided to calculate an average median property price from their constituent specific locations. I know this is an approximation – a weighted average by sales volume would have been ideal if I had that data – but for this analysis, the mean provides a reasonable measure.
  3. I also needed to map the “NZ total” from the property data to the “National” region in the rental data for an overall comparison.
  4. Finally, the dwellings.csv data, which is national and quarterly, needed to be joined to my monthly regional data. I decided to do this by mapping each month to its respective quarter and then filling the national dwelling figures forward across all months within that quarter.

Here’s how I implemented that:

# First, I defined which specific property locations would make up my "Rest of..." rental regions.
# These are based on standard New Zealand regional classifications.
north_island_rest_locations <- c("Northland", "Waikato", "Bay of Plenty", "Gisborne", "Hawke's Bay", "Taranaki", "Manawatu")
south_island_rest_locations <- c("Tasman", "Nelson", "Marlborough", "West Coast", "Otago", "Southland")

# 1. Aggregating property data for "Rest of North Island"
# I'm grouping by date and other original time columns to ensure the mean is calculated correctly for each period.
property_roni <- property %>%
  filter(location %in% north_island_rest_locations) %>%
  group_by(date, year, month_val, time_ref) %>% 
  summarise(median_price = mean(median_price, na.rm = TRUE), .groups = "drop") %>% # Using mean of median_prices
  mutate(region = "Rest of North Island")

# 2. Aggregating property data for "Rest of South Island"
property_rosi <- property %>%
  filter(location %in% south_island_rest_locations) %>%
  group_by(date, year, month_val, time_ref) %>%
  summarise(median_price = mean(median_price, na.rm = TRUE), .groups = "drop") %>%
  mutate(region = "Rest of South Island")

# 3. Preparing property data for regions that have a direct name match, and for the "National" aggregate.
property_direct_match <- property %>%
  filter(location %in% c("Auckland", "Wellington", "Canterbury", "NZ total")) %>%
  mutate(region = case_when(
    location == "NZ total" ~ "National", # Mapping "NZ total" from property data to "National"
    TRUE ~ location # Auckland, Wellington, Canterbury names are kept as is.
  )) %>%
  select(date, year, month_val, time_ref, region, median_price) # Selecting only the columns I need.

# Now, I'm combining these processed property datasets into one table called 'property_mapped'.
property_mapped <- bind_rows(
  property_direct_match,
  property_roni,
  property_rosi
) %>%
  select(date, region, median_price, year, month_val, time_ref) %>% # Standardising columns before the join.
  distinct(date, region, .keep_all = TRUE) # Just a safeguard to ensure no duplicate rows per date/region.

# Quick look at what property_mapped contains for a specific region.
print("Sample of mapped property data for Auckland:")
## [1] "Sample of mapped property data for Auckland:"
print(filter(property_mapped, region == "Auckland") %>% head())
## # A tibble: 6 × 6
##   date       region   median_price  year month_val time_ref
##   <date>     <chr>           <dbl> <dbl>     <dbl>    <dbl>
## 1 1992-01-01 Auckland       135000  1992         1    1992.
## 2 1992-02-01 Auckland       142000  1992         2    1992.
## 3 1992-03-01 Auckland       135000  1992         3    1992.
## 4 1992-04-01 Auckland       135000  1992         4    1992.
## 5 1992-05-01 Auckland       135000  1992         5    1992.
## 6 1992-06-01 Auckland       137000  1992         6    1992.
print("Sample of mapped property data for Rest of North Island:")
## [1] "Sample of mapped property data for Rest of North Island:"
print(filter(property_mapped, region == "Rest of North Island") %>% head())
## # A tibble: 6 × 6
##   date       region               median_price  year month_val time_ref
##   <date>     <chr>                       <dbl> <dbl>     <dbl>    <dbl>
## 1 1992-01-01 Rest of North Island       90000   1992         1    1992.
## 2 1992-02-01 Rest of North Island       91286.  1992         2    1992.
## 3 1992-03-01 Rest of North Island       90571.  1992         3    1992.
## 4 1992-04-01 Rest of North Island       91000   1992         4    1992.
## 5 1992-05-01 Rest of North Island       91143.  1992         5    1992.
## 6 1992-06-01 Rest of North Island       88786.  1992         6    1992.

With the property data now mapped to the same regional definitions as my rental data, I could proceed with merging them. Then, I integrated the national dwellings data.

# Joining my rental data with the 'property_mapped' data.
# Using a left_join to keep all rental records and bring in matching property prices.
merged_rp <- rental %>%
  left_join(property_mapped, by = c("date", "region"), suffix = c("_rent", "_prop")) %>%
  # The join might create suffixed columns if original names (like year, month_val) existed in both.
  # I am cleaning these up, prioritising the ones from the 'rental' dataframe (which I aliased with _rent).
  select(-any_of(c("year_prop", "month_val_prop", "time_ref_prop", "time_ref_rent.y", "year.y", "month_val.y"))) %>% 
  rename(year = year_rent, month_val = month_val_rent, time_ref = time_ref_rent) 

# Next, I prepared the quarterly dwellings data for joining with my monthly 'merged_rp' data.
merged_rp <- merged_rp %>%
  mutate(
    quarter_val = quarter(date), # Figuring out the quarter for each month.
    # Creating a join key: the first month of whatever quarter each date falls into.
    dwelling_join_month = case_when(
      quarter_val == 1 ~ 1, quarter_val == 2 ~ 4,
      quarter_val == 3 ~ 7, quarter_val == 4 ~ 10
    ),
    dwelling_date_join = make_date(year, dwelling_join_month, 1)
  )

# Selecting and renaming columns from dwellings data for a clean join.
dwellings_for_join <- dwellings %>%
  select(date_dwelling_source = date, owner_occupied, provided_free, rented, total)

# Performing the join to get the final 'full_data' table.
# Dwelling data is quarterly, so I'm using fill() to propagate these national figures
# to all months within that quarter for each region.
full_data <- merged_rp %>%
  left_join(dwellings_for_join, by = c("dwelling_date_join" = "date_dwelling_source")) %>%
  # Grouping by year and quarter ensures fill operates correctly within each period.
  group_by(year, quarter_val) %>% 
  fill(owner_occupied, provided_free, rented, total, .direction = "downup") %>% 
  ungroup() %>% # Ungrouping is important after grouped operations.
  filter(!is.na(total)) # Ensuring only rows with successfully joined dwelling data are kept.

# Let's check the structure and a sample of the final merged data.
print("Structure of full_data:")
## [1] "Structure of full_data:"
str(full_data, list.len=5)
## tibble [984 × 14] (S3: tbl_df/tbl/data.frame)
##  $ time_ref           : num [1:984] 2006 2006 2007 2007 2007 ...
##  $ rent_cpi           : num [1:984] 1000 995 1007 1012 1020 ...
##  $ region             : chr [1:984] "Auckland" "Auckland" "Auckland" "Auckland" ...
##  $ year               : num [1:984] 2006 2006 2007 2007 2007 ...
##  $ month_val          : num [1:984] 11 12 1 2 3 4 5 6 7 8 ...
##   [list output truncated]
print("Sample of final full_data:")
## [1] "Sample of final full_data:"
print(head(select(full_data, date, region, rent_cpi, median_price, owner_occupied, total)))
## # A tibble: 6 × 6
##   date       region   rent_cpi median_price owner_occupied   total
##   <date>     <chr>       <dbl>        <dbl>          <dbl>   <dbl>
## 1 2006-11-01 Auckland     1000       425000        1098600 1649200
## 2 2006-12-01 Auckland      995       425000        1098600 1649200
## 3 2007-01-01 Auckland     1007       419500        1101500 1655400
## 4 2007-02-01 Auckland     1012       435000        1101500 1655400
## 5 2007-03-01 Auckland     1020       441000        1101500 1655400
## 6 2007-04-01 Auckland     1023       452000        1104000 1661100

This merging process was a bit fiddly, especially getting the regional mapping and the quarterly-to-monthly join right, but I think full_data now correctly combines all the information I need for the next steps of feature engineering and analysis.

4. Feature Engineering

With the full_data table now properly integrated, my next step was to engineer some new features. My thinking here was that creating more specific indicators from the existing data could help uncover deeper insights during EDA and provide more meaningful inputs for the predictive models. The features I decided to create include:

  • National Dwelling Ratios: Proportions of rented, owner-occupied, and freely provided dwellings at a national level. Even though these are national, their change over time might correlate with regional rental market pressures.
  • Price-to-Rent CPI Ratio: This is a common metric in housing market analysis, often used as an indicator of affordability or investment potential. I calculated this on a regional basis.
  • Log Transformations: Applying a log transformation to median_price and rent_cpi can help stabilise variance and linearise relationships, which is often beneficial for modelling.
  • Lagged Values and Growth Rates: To capture market dynamics, I calculated month-on-month growth rates for both median property prices and Rent CPI. This required creating lagged versions of these variables first. It was really important to do these calculations after grouping by region and arranging by date to ensure the lags and growth rates were correctly computed for each region’s specific time series.
  • Rolling Averages: I also calculated 3-month rolling averages for Rent CPI and median prices to smooth out some of the month-to-month volatility and highlight underlying trends.

After creating these new features, especially the lagged and rolling ones, some initial rows for each region would have NA values (e.g., you can’t have a 3-month rolling average for the first two months of data). I filtered these out to ensure the dataset was clean for the subsequent analysis.

# Now I'm creating those new features I talked about.
full_data <- full_data %>%
  group_by(region) %>% # Grouping by region is crucial for lag and rolling calculations
  arrange(date) %>%    # Making sure data is sorted by date within each region first
  mutate(
    # National dwelling ratios (these are from national data, so they'll be the same for all regions in a given quarter)
    rented_ratio_national = rented / total,
    owner_ratio_national = owner_occupied / total,
    free_ratio_national = provided_free / total,
    
    # Regional Price-to-Rent CPI ratio. Checking for zeros or NAs to avoid errors.
    price_rent_cpi_ratio = ifelse(rent_cpi > 0 & !is.na(median_price) & median_price > 0, median_price / rent_cpi, NA),
    
    # Log transformations for potentially skewed variables.
    log_median_price = ifelse(!is.na(median_price) & median_price > 0, log(median_price), NA),
    log_rent_cpi = ifelse(!is.na(rent_cpi) & rent_cpi > 0, log(rent_cpi), NA),
    
    # Lagged values (previous month's value) for calculating growth rates.
    median_price_lag = lag(median_price, 1), 
    rent_cpi_lag = lag(rent_cpi, 1),         
    
    # Monthly growth rates - (current - previous) / previous.
    price_growth_monthly = ifelse(!is.na(median_price_lag) & median_price_lag != 0, (median_price - median_price_lag) / median_price_lag, NA),
    rent_growth_monthly = ifelse(!is.na(rent_cpi_lag) & rent_cpi_lag != 0, (rent_cpi - rent_cpi_lag) / rent_cpi_lag, NA),
    
    # 3-month rolling averages to smooth out short-term fluctuations.
    # align = "right" means the average is calculated using the current and past two periods.
    rent_cpi_roll_avg3 = rollmean(rent_cpi, 3, fill = NA, align = "right", na.rm = TRUE), 
    median_price_roll_avg3 = rollmean(median_price, 3, fill = NA, align = "right", na.rm = TRUE) 
  ) %>%
  ungroup() %>% # Important to ungroup after grouped mutations.
  # Removing initial rows that would have NAs due to lag/rollmean calculations.
  # Using month(2) filter ensures that for each group, at least 2 prior records were available for the 3-month rolling mean.
  filter(date >= (min(full_data$date[which(!is.na(full_data$owner_occupied))], na.rm=TRUE) %m+% months(2))) 

print(paste("Final full_data rows after Feature Engineering and initial NA filter:", nrow(full_data)))
## [1] "Final full_data rows after Feature Engineering and initial NA filter: 972"
# Just having a quick look at some of the new features to make sure they look right.
print("Sample of full_data with some engineered features:")
## [1] "Sample of full_data with some engineered features:"
print(head(select(full_data, date, region, rent_cpi, median_price, price_growth_monthly, rent_growth_monthly, price_rent_cpi_ratio, rent_cpi_lag)))
## # A tibble: 6 × 8
##   date       region     rent_cpi median_price price_growth_monthly rent_growth_monthly price_rent_cpi_ratio rent_cpi_lag
##   <date>     <chr>         <dbl>        <dbl>                <dbl>               <dbl>                <dbl>        <dbl>
## 1 2007-01-01 Auckland       1007      419500             -0.0129                0.0121                 417.          995
## 2 2007-01-01 Wellington     1010      350000             -0.0278                0.0423                 347.          969
## 3 2007-01-01 Rest of N…     1003      280571.             0.0345                0.0277                 280.          976
## 4 2007-01-01 Canterbury     1023      294800             -0.000678             -0.0173                 288.         1041
## 5 2007-01-01 Rest of S…     1052      257333.             0.0114                0.0746                 245.          979
## 6 2007-01-01 National       1025      325500             -0.0136                0.0333                 318.          992

These engineered features should give me more angles to analyse the data from in the EDA, and hopefully provide some stronger signals for the machine learning models later on.

5. Exploratory Data Analysis (EDA)

With my full_data table prepared and features engineered, I was ready to dive into some Exploratory Data Analysis. My main goal here was to visually inspect the data, look for trends, patterns, relationships, and any potential outliers or issues that might affect my later modelling. I planned to look at time series trends, distributions of key variables, correlations, and specific comparisons between metrics like rent, price, and dwelling types.

First, I’ll make sure my global plotting theme is set. I want my plots to have a consistent, professional look with a light grey background for better readability.

5.1. Rent CPI Trend by Broad Region

I started my exploratory data analysis by looking at how the Rent Consumer Price Index (Rent CPI) has changed over time. My aim here was to get a general feel for the rental market trends across the different broad regions I’d defined. I wanted to see the overall direction of these trends and quickly identify if any regions behaved noticeably differently from the others or the national average.

# Plot 1: Rent CPI Trend 
# Taking the 'full_data' and filtering for the key regions I defined,
# and ensuring there's actual rent_cpi data to plot.
plot1_data <- full_data %>% 
  filter(region %in% key_regions_focus & !is.na(rent_cpi))

if (nrow(plot1_data) > 0) {
  plot1 <- ggplot(plot1_data, aes(x = date, y = rent_cpi, color = region, group = region)) +
    geom_line(alpha = 0.9, linewidth = 1) + # Linewidth at 1 for clear lines
    # Using a nice colour palette from RColorBrewer. 'Region:' for the legend title.
    scale_colour_brewer(palette = "Dark2", name = "Region:") +
    # Formatting the y-axis to show numbers with commas for better readability.
    scale_y_continuous(labels = comma) +
    labs(
      title = "Monthly Rent CPI Over Time by Broad Region",
      x = "Date", 
      y = "Rent CPI"
    ) +
    # Making sure the legend is in a single row if it's at the bottom (as per my global theme).
    guides(color = guide_legend(nrow = 1)) 
  
  # This plot will use the global theme_set I defined earlier, 
  # so it should have the light grey background and professional styling.
  print(plot1)
} else { 
  # Just in case there's no data for some reason after filtering.
  print("Plot 5.1 (Original Plot 1): No data available after filtering.")
}
Plot 5.1: Monthly Rent CPI Over Time by Broad Region.

Plot 5.1: Monthly Rent CPI Over Time by Broad Region.

My Insights from Plot 1:

Looking at this first plot, it’s quite clear that there’s been a general upward trend in Rent CPI across all the broad regions I am looking at. Auckland consistently shows a higher CPI level compared to other regions throughout the entire period. The “National” trend line, as I would expect, sits somewhere in the middle, acting like an average. It’s also interesting to see how the “Rest of North Island” and “Rest of South Island” track fairly closely to each other, generally sitting below the main centers like Auckland and Wellington. This gives me a good baseline understanding of rental cost movements.

5.2. Median Property Price Trend by Broad Region

Following a similar approach to the Rent CPI, I then plotted the time series for Median Property Prices across the same broad regions. I was particularly interested to see if property prices showed more volatility or different growth patterns compared to rents. It’s also important to remember that for the “Rest of North Island” and “Rest of South Island” categories, these median prices are an average I calculated from several constituent locations. The “National” trend is based on the “NZ total” from the original property data.

# Plot 2: Median Property Price Trend 
# Again, filtering for my key regions and ensuring there's median_price data.
plot2_data <- full_data %>% 
  filter(region %in% key_regions_focus & !is.na(median_price))

if (nrow(plot2_data) > 0) {
  plot2 <- ggplot(plot2_data, aes(x = date, y = median_price, color = region, group = region)) +
    geom_line(alpha = 0.9, linewidth = 1) + # Consistent line styling
    # Using a different colour palette ("Set1") to distinguish from the Rent CPI plot.
    scale_colour_brewer(palette = "Set1", name = "Region:") +
    # Formatting the y-axis to show prices in thousands of dollars for better readability.
    scale_y_continuous(labels = dollar_format(prefix = "$", scale = 1/1000, suffix = "K")) +
    labs(
      title = "Monthly Median Property Price by Broad Region",
      subtitle = "'Rest of...' regions use averaged prices. 'National' uses 'NZ total'.",
      x = "Date", 
      y = "Median Property Price"
    ) +
    guides(color = guide_legend(nrow = 1)) # Ensuring legend is neat.
  
  # This plot also uses the global theme_set established earlier for the grey background and general styling.
  print(plot2)
} else { 
  print("Plot 5.2 (Original Plot 2): No data available after filtering.")
}
Plot 5.2: Monthly Median Property Price by Broad Region.

Plot 5.2: Monthly Median Property Price by Broad Region.

My Insights from Plot 2:

Just as I suspected, property prices show much more dramatic changes and volatility. Auckland’s median price, in particular, has seen very significant increases, far outpacing other regions and the national average, especially in certain periods. The other regions also trend upwards but at different rates and with different peaks and troughs.

5.3. Correlation Matrix of Key Variables

After looking at how Rent CPI and Median Property Prices changed over time, I wanted a quick way to see if there were any straight-line relationships between some of my main numbers. A correlation matrix is pretty handy for this. It shows a grid where each cell tells you how strongly two variables move together – a positive number means they tend to go up or down together, a negative number means one goes up as the other goes down, and a number close to zero means not much of a straight-line link.

I picked some of the original metrics like rent_cpi and median_price, plus some of the new ones I made, like the price_rent_cpi_ratio and the monthly growth rates.

# Plot 3: Correlation Matrix
# Note: corrplot is not a ggplot2 based plot, so my global theme_set for grey backgrounds won't automatically apply here.
# However, I have added a 'bg = "grey95"' argument directly to the corrplot function
# to try and keep the look somewhat consistent with the other plots.
numeric_cols_for_corr <- c("rent_cpi", "median_price", "rented_ratio_national", "owner_ratio_national", 
                           "price_rent_cpi_ratio", "price_growth_monthly", "rent_growth_monthly")
correlation_data <- full_data %>% 
  select(all_of(numeric_cols_for_corr)) %>% 
  drop_na() # Correlation can only be calculated on complete pairs of data.

if(nrow(correlation_data) > 1 && ncol(correlation_data) > 1) {
  cor_matrix <- cor(correlation_data) # Calculating the correlation coefficients
  
  # Now plotting the matrix.
  # 'method = "color"' shows correlations with colours and shades.
  # 'type = "upper"' just shows the upper triangle to avoid repetition.
  # 'order = "hclust"' reorders the variables to group similar ones together.
  # 'addCoef.col' puts the actual correlation numbers on the plot.
  corrplot(cor_matrix, 
           method = "color", 
           type = "upper", 
           order = "hclust",
           addCoef.col = "black", 
           tl.col = "black",               # Text label colour
           tl.srt = 45,                  # Text label string rotation
           tl.cex = 0.8,                 # Text label character expansion (size)
           number.cex = 0.7,             # Correlation coefficient number size
           diag = FALSE,                 # Don't show the diagonal
           col = brewer.pal(n=10, name="RdYlBu"), # A nice colour palette
           bg = "grey95")                # Setting a light grey background for the plot panel
  
  # Adding a title manually as corrplot doesn't integrate with ggplot theming for titles easily.
  title("Correlation Matrix of Key Housing Variables (Monthly Data)", line = 3, cex.main=1.1, col.main="grey10")
} else { 
  print("Plot 5.3 (Original Plot 3): Not enough data for correlation plot after dropping NAs from selected columns.") 
}
Plot 5.3: Correlation Matrix of Key Housing Variables.

Plot 5.3: Correlation Matrix of Key Housing Variables.

My Insights from Plot 3:

This matrix is quite useful. Straight away, I can see that rent_cpi and median_price have a pretty strong positive correlation (the blue colour and the number (around 0.7 or 0.8, depending on the exact data slice) show this). This means they generally tend to trend in the same direction. The price_rent_cpi_ratio is, as you would expect, strongly linked to median_price (because it’s part of the calculation) and has a weaker link with rent_cpi. The monthly growth rates (price_growth_monthly, rent_growth_monthly) don’t show super strong correlations with the absolute levels of prices or rents, which makes sense because growth can be up and down even if overall levels are high or low. This plot is useful for spotting if any of my predictors for the later machine learning models are too highly correlated with each other, which might cause issues.

5.4. Distribution of Rent CPI - Faceted by Broad Region

After looking at the time series trends, I wanted to understand the actual distribution of Rent CPI values for each key region. A histogram overlaid with a density plot, faceted by region, seemed like a good way to visualise this. I used scales = "free" for the facets because the Rent CPI ranges and distributions can be quite different across regions, and I wanted to see the individual distributions clearly.

# Plot 4 in my R script: Distribution of Rent CPI
# Filtering for the key regions and where rent_cpi is not NA.
plot4_data <- full_data %>% 
  filter(region %in% key_regions_focus & !is.na(rent_cpi))

if(nrow(plot4_data) > 0){
  plot4 <- ggplot(plot4_data, aes(x = rent_cpi)) +
    # Histogram shows the counts in different bins. Using density on y-axis to overlay density curve.
    geom_histogram(aes(y = after_stat(density), fill = region), alpha = 0.8, bins = 25, show.legend = FALSE) +
    # Density plot gives a smoother look at the distribution shape.
    geom_density(aes(color = region), linewidth = 0.9, show.legend = FALSE) +
    # Using different colour palettes for fill and the line for clarity.
    scale_fill_brewer(palette = "Pastel1") + 
    scale_color_brewer(palette = "Set1") +
    # Faceting by region, with free scales so each plot adapts to its own data range.
    facet_wrap(~region, scales = "free") + 
    labs(
      title = "Distribution of Rent CPI by Broad Region", 
      x = "Rent CPI", 
      y = "Density"
    ) +
    theme(strip.text = element_text(size=rel(0.85))) # Adjusting facet label size.
  
  # This plot uses the global theme_set.
  print(plot4)
} else { 
  print("Plot 5.4 (Original Plot 4): No data available after filtering.")
}
Plot 5.4: Distribution of Rent CPI by Broad Region.

Plot 5.4: Distribution of Rent CPI by Broad Region.

Insights:

The analysis explores how the Rent Consumer Price Index (CPI) is distributed across various New Zealand regions using histograms and kernel density estimations. Rent market behavior differs significantly by region.

Auckland:

  • The Rent CPI shows multimodality, meaning several peaks are present.
  • There is high dispersion, indicating a wide range of rent index values.
  • Suggests multiple factors influencing rental price variation.

Canterbury:

  • Displays a unimodal pattern with a clear peak.
  • The mode is around 1350–1400 CPI.
  • Lower variance than Auckland, indicating more uniform rent changes across the region.

Wellington:

  • Strong mode around 1050–1150 CPI, showing many lower rent index increases.
  • A second, weaker mode at higher CPI values indicates a minority with larger increases.

Rest of North Island:

  • The data is positively skewed, with a long tail to the right.
  • Main mode lies between 1050–1150 CPI.
  • Indicates generally small rent index increases.

Rest of South Island:

  • Largely unimodal with central tendency between 1100–1200 CPI.
  • Medium dispersion suggests relatively stable rental market behavior.

National:

  • Shows a unimodal shape with the main mode around 1100–1200 CPI.
  • Represents combined data from all regions with medium dispersion.
  • Useful for comparing central tendencies and spread across regions.

5.5. Distribution of Median Property Prices - Faceted by Broad Region

Similar to what I did for Rent CPI, I wanted to see the distribution of median property prices across the regions. I was expecting these to be even more skewed than the Rent CPI, especially for regions like Auckland, because property prices have seen some really big jumps as shown in the earlier time series plots. Using scales = "free" on the facets is essential here due to the wide variation in price levels between regions.

# Plot 5 in your R script: Distribution of Median Property Prices
plot5_data <- full_data %>% 
  filter(region %in% key_regions_focus & !is.na(median_price))

if(nrow(plot5_data) > 0){
  plot5 <- ggplot(plot5_data, aes(x = median_price)) +
    geom_histogram(aes(y = after_stat(density), fill = region), alpha = 0.8, bins = 25, show.legend = FALSE) + # Using after_stat()
    geom_density(aes(color = region), linewidth = 0.9, show.legend = FALSE) +
    scale_fill_brewer(palette = "Pastel2") + 
    scale_color_brewer(palette = "Set2") +
    # Formatting x-axis as thousands of dollars for better readability.
    scale_x_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) + 
    facet_wrap(~region, scales = "free") + # Using 'free' scales as price ranges vary a lot by region
    labs(
      title = "Distribution of Median Prices by Broad Region", 
      # Adding ($K) to x-axis label for clarity as per scale_x_continuous
      x = "Median Price ($K)", 
      y = "Density"
    ) +
    theme(strip.text = element_text(size=rel(0.85)))
  print(plot5)
} else { 
  print("Plot 5.5 (Original Plot 5): No data after filtering.")
}
Plot 5.5: Distribution of Median Prices by Broad Region.

Plot 5.5: Distribution of Median Prices by Broad Region.

Insights:

Auckland

The Auckland median property price data shows clear multimodality, with at least two main peaks. One mode (common price range) is around $500K-$600K and another, more prominent mode is near $800K-$900K. The data has a wide dispersion, meaning prices are very spread out, reflecting a diverse and segmented property market.

Canterbury

Canterbury’s median property price data appears bimodal (two main peaks). One mode is around $350K, and a more significant mode is visible around the $450K mark. This suggests two common price clusters in the Canterbury region’s property market.

Wellington Wellington’s median property price data appears broadly unimodal, with its main mode or central tendency around $400K-$500K. The distribution is quite spread out, with a tail extending towards higher prices, indicating some variability but a clear concentration in that mid-range.

Rest of North Island

This region’s median property price data is strongly unimodal and positively skewed (the tail of the data goes to the right, towards higher prices). The mode is clearly at the lower end of the price scale, around $275K-$300K. This indicates that most properties in these areas have lower median prices.

Rest of South Island

Similar to the Rest of North Island, the median property price data here is also strongly unimodal and positively skewed. The mode is also concentrated at the lower end, around $275K-$300K. This suggests that median property prices in these parts of the South Island are generally lower.

National

The median property price data for the whole country (National) is broadly spread, possibly unimodal but with a wide base, or slightly multimodal. The main concentration of prices seems to be between $350K and $500K, with a tail showing some much higher prices. This reflects the average of all diverse regional markets.

5.6. Rent CPI vs. Median Property Price - Faceted by Region

After looking at their individual distributions and time trends, I wanted to directly compare Rent CPI against Median Property Prices for each region. My idea was to see if regions with higher property prices also tend to have higher rents, and how tightly these two are linked. I’ve added a linear regression line (the dashed line) to each plot to show the general trend, and also included the Pearson’s R-squared value and p-value to get a statistical idea of the strength and significance of this linear relationship.

# Plot 6 in your R script: Rent CPI vs Median Price, with correlation
plot6_data <- full_data %>% 
  filter(region %in% key_regions_focus & !is.na(median_price) & !is.na(rent_cpi))

if(nrow(plot6_data) > 0){
  plot6 <- ggplot(plot6_data, aes(x = median_price, y = rent_cpi)) +
    # Points are slightly transparent and coloured by region (though faceting also separates them).
    geom_point(aes(color = region), alpha = 0.4, show.legend = FALSE, size=1) + 
    # Adding a linear model smooth line without confidence interval for a cleaner look.
    geom_smooth(method = "lm", se = FALSE, color = "black", linetype = "dashed", linewidth=0.6) +
    # Using ggpubr to add R-squared and p-value directly onto the plot facets.
    # Using after_stat() for labels as per newer ggplot2/ggpubr versions.
    # Customising the label to show R^2 and a more readable p-value.
    ggpubr::stat_cor(
      aes(label = paste(after_stat(rr.label), 
                        if_else(readr::parse_number(after_stat(p.label)) < 0.001, 
                                "p < 0.001", # Displaying very small p-values nicely
                                paste("p =", scales::pvalue(readr::parse_number(after_stat(p.label)), accuracy=0.001, add_p = TRUE)) # Formatting p-value
                               ), 
                        sep = "~`,`~")), # Comma separation for R^2 and p
      method = "pearson", 
      label.x.npc = "left", label.y.npc = "top", # Positioning the stats text
      r.digits = 2, p.digits = 3, # Number of digits for R-squared and p-value
      size = 3.0, # Text size for the stats
      colour = "grey10" # Text colour
    ) +
    scale_x_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) +
    scale_color_brewer(palette = "Dark2") + # Colour palette for points (if shown without faceting)
    facet_wrap(~region, scales = "free") + # 'free' scales are crucial here for both axes
    labs(
      title = "Rent CPI vs. Median Property Price", 
      subtitle = "Relationship by Broad Region, with Pearson's R-squared and p-value", 
      x = "Median Property Price ($K)", 
      y = "Rent CPI"
    ) +
    theme(strip.text = element_text(size=rel(0.85))) # Adjusting facet label size
  print(plot6)
} else { 
  print("Plot 5.6 (Original Plot 6): No data after filtering.")
}
Plot 5.6: Rent CPI vs. Median Property Price by Broad Region, with Correlation Statistics.

Plot 5.6: Rent CPI vs. Median Property Price by Broad Region, with Correlation Statistics.

Insights:

Auckland

In Auckland, there is a strong positive linear relationship between Median Property Price and Rent CPI, with a Pearson correlation coefficient (R) of 0.97. This high R-value indicates that as median property prices increase, Rent CPI also tends to increase in a very consistent, linear way. The data points are tightly clustered around the positive sloping trend line.

Canterbury

Canterbury also shows a strong positive linear relationship, with an R-value of 0.92. While still a very strong correlation, it’s slightly less than Auckland’s. This means that higher median property prices are strongly associated with higher Rent CPI, and the data points follow the positive trend line closely.

Wellington

Wellington also demonstrates a strong positive linear relationship between Median Property Price and Rent CPI, with an R-value of 0.96. This high correlation suggests that increases in median property prices are closely linked with increases in Rent CPI, following a clear linear pattern.

Rest of North Island

This region exhibits a strong positive linear relationship, indicated by an R-value of 0.96. This high correlation means there’s a consistent tendency for Rent CPI to be higher when median property prices are higher. The data points align closely with the upward sloping trend line.

Rest of South Island

Similar to other regions, the Rest of South Island shows a strong positive linear relationship, with an R-value of 0.94. This indicates a strong association where higher median property prices correspond with higher Rent CPI. The points are generally well-aligned with the positive trend.

National

For the entire country (National), the correlation between Median Property Price and Rent CPI is extremely strong and positive, with an R-value of 0.98. This suggests a very consistent linear trend nationwide: as property prices go up, rent CPIs also tend to go up. The scatter of points is minimal around the trend line.

5.7. National Dwelling Composition Over Time

Next, I wanted to get a sense of the broader housing situation in New Zealand by looking at the national trends in dwelling tenure. I was curious to see how the proportions of rented, owner-occupied, and freely provided dwellings have changed over the years. This provides important context for the rental market. I used a stacked area chart to show how these different tenure types contribute to the total housing stock over time.

# Plot 7 in my R script: National Dwelling Composition - Stacked Area
# These ratios (rented_ratio_national, etc.) are based on national dwelling data, 
# so they are the same for all regions at a given point in time.
# I only need one set of these national ratios per date for plotting.
national_dwelling_ratios_long <- full_data %>%
  filter(!is.na(rented_ratio_national) & !is.na(owner_ratio_national) & !is.na(free_ratio_national)) %>%
  # distinct() keeps one row per date, effectively giving me the national trend.
  distinct(date, .keep_all = TRUE) %>% 
  # Renaming for prettier legend labels in the plot.
  select(date, 
         `Rented` = rented_ratio_national, 
         `Owner Occupied` = owner_ratio_national, 
         `Provided Free` = free_ratio_national) %>%
  # Pivoting the data longer to make it suitable for a stacked area chart with ggplot.
  pivot_longer(cols = -date, names_to = "dwelling_type", values_to = "ratio") 

if(nrow(national_dwelling_ratios_long) > 0){
  plot7 <- ggplot(national_dwelling_ratios_long, aes(x = date, y = ratio, fill = dwelling_type)) +
    geom_area(alpha = 0.85, position = "stack") + # Stacked area chart to show proportions.
    scale_y_continuous(labels = percent_format()) + # Y-axis as percentage.
    scale_fill_brewer(palette = "Accent", name = "Dwelling Type:") + # A nice colour palette for fills.
    labs(
      title = "National Dwelling Composition Over Time", 
      x = "Date", 
      y = "Proportion of Total Dwellings"
    )
    # This plot uses the global theme_set I defined earlier.
  print(plot7)
} else { 
  print("Plot 5.7 (Original Plot 7): No data available after filtering.")
}
Plot 5.7: National Dwelling Composition Over Time.

Plot 5.7: National Dwelling Composition Over Time.

Insights:

Owner Occupied Dwellings:

The proportion of owner-occupied dwellings (the green area) consistently represents the largest share of total dwellings throughout the observed period. This stratum appears relatively stable over time, though there might be a very slight, gradual decrease. It consistently accounts for more than 60% of all dwellings, indicating that outright ownership or mortgage-based ownership is the dominant tenure type.

Rented Dwellings:

The proportion of rented dwellings (the orange area) is the second-largest component of the national housing stock. There appears to be a slight but noticeable increasing trend in this proportion over the years shown. Starting from around 30%, the ratio of rented dwellings seems to have gradually risen, suggesting a growing rental sector relative to the total housing.

Provided Free Dwellings:

Dwellings Provided Free (the purple area) constitute the smallest proportion of the three tenure types. This category remains consistently small and relatively stable across the time series, representing a low single-digit percentage (likely around 5-7%) of the total dwellings. Its contribution to the overall housing composition shows minimal fluctuation.

5.8. Distribution of Monthly Property Price Growth by Year

After looking at the overall trends and distributions of actual price and rent levels, I wanted to dig into their growth rates. For property prices, I decided to use boxplots to see the distribution of monthly price growth for each year, and I did this separately for each broad region using facets. This can show me which years had particularly high or low growth, how much growth varied within each year (the size of the box and whiskers), and if there were many outlier months.

# Plot 8 in my R script: Property Price Growth (Monthly) - Boxplot by Year, Faceted
# I'm filtering for my key regions and making sure price_growth_monthly is a valid number.
plot8_data <- full_data %>%
  filter(region %in% key_regions_focus & !is.na(price_growth_monthly) & is.finite(price_growth_monthly)) %>%
  mutate(year_factor = factor(year)) # Treating 'year' as a factor for the x-axis categories.

if(nrow(plot8_data) > 0){
  plot8 <- ggplot(plot8_data, aes(x = year_factor, y = price_growth_monthly, fill = region)) +
    # Boxplots are great for summarising distributions.
    # I am not showing the legend for 'fill' here because the facets already separate regions.
    geom_boxplot(show.legend = FALSE, outlier.alpha = 0.3, outlier.size=0.8, linewidth=0.3) + 
    # A horizontal line at zero helps see positive vs. negative growth months.
    geom_hline(yintercept = 0, linetype = "dotted", color = "red4", linewidth=0.6) +
    # Formatting the y-axis as percentages.
    scale_y_continuous(labels = percent_format(accuracy = 1L)) + # Changed accuracy for potentially small percentages
    scale_fill_brewer(palette = "Pastel2") + # A nice soft colour palette.
    facet_wrap(~region, scales = "free_y", ncol = 3) + # Free y-scales because growth ranges might differ by region.
    labs(
      title = "Distribution of Monthly Property Price Growth by Year", 
      subtitle="Faceted by Broad Region", 
      x = "Year", 
      y = "Monthly Price Growth (%)" # Added units to y-axis label
    ) +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size=rel(0.8)), # Angling x-axis text for readability.
      strip.text = element_text(size=rel(0.85)) # Adjusting facet label size.
    )
  print(plot8)
} else { 
  print("Plot 5.8 (Original Plot 8): No data available after filtering.")
}
Plot 5.8: Distribution of Monthly Property Price Growth by Year, Faceted by Broad Region.

Plot 5.8: Distribution of Monthly Property Price Growth by Year, Faceted by Broad Region.

Insights

  1. Auckland

    • Median monthly growth rates (the line in the middle of the box) fluctuate year by year, often hovering near or slightly above 0%. Some years (e.g., around 2013–2016) show consistently positive median growth.
    • The Interquartile Range (IQR), shown by the height of the boxes, indicates varying levels of volatility (spread) in growth rates within each year. Some years have tighter boxes (less spread), others are wider.
    • Numerous outliers (dots above/below the whiskers) are visible in most years, indicating months with unusually high or low growth compared to the typical range for that year.
    • There are periods of sustained positive median growth and periods where medians are closer to zero or slightly negative.
  2. Canterbury

    • Median monthly growth rates in Canterbury also vary annually. There are distinct periods of positive median growth (e.g., around 2012–2015) and periods where medians are closer to zero.
    • The IQR changes from year to year, showing shifts in the consistency of monthly growth.
    • Outliers are present, particularly on the upside in some years, suggesting occasional months of very strong growth.
    • The overall pattern shows cycles of higher and lower typical monthly growth.
  3. Wellington

    • Wellington’s median monthly growth rates vary, with some years showing positive medians (e.g., around 2015–2017, 2019–2020) and others hovering near zero or slightly negative.
    • The IQR and presence of outliers indicate variability in monthly growth within each year. Note the y-axis scale is wider here (up to 10%, down to –10%), suggesting some months experienced quite large positive or negative growth.
    • Certain years, like 2020, show a higher median and a wider spread, indicating both stronger typical growth and more volatility.
  4. Rest of North Island

    • Median monthly growth rates are often close to 0%, but with some years showing slightly positive medians (e.g., 2013–2016, 2019–2020).
    • The IQR suggests moderate volatility in monthly growth for most years.
    • Some outliers are present, but perhaps fewer extreme ones compared to Auckland or National.
    • The pattern suggests relatively stable but modest growth for many periods.
  5. Rest of South Island

    • This region shows median monthly growth rates that fluctuate around 0%, with some years experiencing slightly negative medians (e.g., 2008–2009) and others slightly positive.
    • The dispersion of growth rates within years (box height) appears relatively consistent across many years.
    • Outliers are visible, indicating some months with growth rates outside the typical range.
    • The overall impression is of relatively low and sometimes negative median monthly growth, with periods of modest positive growth.
  6. National

    • The national view shows median monthly growth generally positive across many years, though with fluctuations. Years like 2008–2009 show medians closer to or below zero, reflecting a downturn.
    • The dispersion (spread) of growth rates within each year (box height) is evident, with some years showing more variability than others.
    • Outliers, both positive and negative, are visible, highlighting months that deviated significantly from the typical growth for their respective years.
    • The y-axis scale here goes up to 10% and down to –5%, indicating a wider range of national growth experiences compared to some individual regions shown.

5.9. Monthly Rent CPI Growth Over Time - Faceted

After looking at the distributions of property price growth, I wanted to see a similar view for Rent CPI growth. Instead of boxplots by year, for this one I chose to plot the monthly Rent CPI growth rate as a time series line for each region. Faceting by region helps to see each trend clearly, and I’ve added a horizontal line at 0% to easily distinguish periods of rent growth from rent decline.

# Plot 9 in my R script: Rent CPI Growth (Monthly) - Line plot, Faceted
# Filtering for my key regions and ensuring rent_growth_monthly is a valid number.
plot9_data <- full_data %>% 
  filter(region %in% key_regions_focus & !is.na(rent_growth_monthly) & is.finite(rent_growth_monthly))

if(nrow(plot9_data) > 0){
  plot9 <- ggplot(plot9_data, aes(x = date, y = rent_growth_monthly, color = region, group = region)) +
    # Using geom_line to see the trend over time.
    # Since I am faceting by region and also colouring by region, the legend isn't strictly necessary here.
    geom_line(alpha = 0.8, linewidth=0.9, show.legend = FALSE) + 
    # A line at y=0 helps to quickly see positive vs. negative growth.
    geom_hline(yintercept = 0, linetype = "dashed", color = "grey30", linewidth=0.6) +
    # Formatting y-axis as percentages.
    scale_y_continuous(labels = percent_format(accuracy = 0.1)) +
    # Assigning colours to regions, though they are mostly distinguished by facets here.
    scale_color_brewer(palette = "Set2", name="Region:") + 
    labs(
      title = "Monthly Rent CPI Growth Over Time", 
      subtitle="Faceted by Broad Region", 
      x = "Date", 
      y = "Monthly Rent CPI Growth (%)" # Added units
    ) +
    facet_wrap(~region, scales="free_y", ncol=2) + # Free y-scales for each region's growth range.
    # Hiding the legend as the facets make it clear which region is which.
    theme(legend.position = "none", strip.text = element_text(size=rel(0.85))) 
  
  # This plot uses the global theme_set.
  print(plot9)
} else { 
  print("Plot 5.9 (Original Plot 9): No data available after filtering.")
}
Plot 5.9: Monthly Rent CPI Growth Over Time, Faceted by Broad Region.

Plot 5.9: Monthly Rent CPI Growth Over Time, Faceted by Broad Region.

Insights

Auckland:

Monthly Rent CPI growth in Auckland shows consistent volatility, with frequent fluctuations typically ranging between -1.0% and +2.0%. There are no very long, sustained periods of exceptionally high or low growth, but rather a persistent pattern of small monthly increases and occasional small decreases. The mean growth appears to be slightly positive over the entire period.

Canterbury:

Canterbury’s monthly Rent CPI growth also exhibits considerable volatility. The typical range of fluctuations appears to be roughly between -2.5% and +2.5%. There are noticeable spikes, both positive and negative, throughout the time series. For instance, there are some sharper positive growth spikes visible around 2012–2014, and also some distinct negative spikes at other times.

Wellington:

Wellington’s monthly Rent CPI growth is also quite volatile, with fluctuations often ranging between -2.5% and +2.5%, but with notable outlier spikes reaching up to +5.0% (e.g., around 2008) and down to nearly -5.0% (e.g., around 2009 and late 2019/early 2020). The time series shows periods of more clustered positive growth and other periods with more negative dips.

Rest of North Island:

Monthly Rent CPI growth in the Rest of the North Island is also volatile, generally fluctuating between approximately -2.0% and +2.0%. The amplitude of these fluctuations seems relatively consistent over the observed period. There are no obvious long-term shifts in the average growth rate, which hovers around zero, perhaps slightly positive.

Rest of South Island:

This region shows the most dramatic volatility in monthly Rent CPI growth, with a y-axis scale ranging from -5.0% to +10.0%. There are several very large positive spikes (e.g., approaching 10% around 2008 and another significant one around 2011–2012) and also some sharp negative spikes. This indicates a much less stable pattern of monthly rent changes compared to other regions. The variance in growth rates is clearly higher here.

National:

The national monthly Rent CPI growth displays a time series pattern that averages out some of the more extreme regional fluctuations, but still shows clear volatility. Growth rates generally move between -1.0% and +2.0%, with occasional spikes beyond this range (e.g., a positive spike near 3% around 2008 and another around 2016). The overall trend seems to be centered slightly above zero.

5.10. Heatmap of Average Price-to-Rent CPI Ratio

To get a different view of how property prices relate to rents, I decided to calculate the average Price-to-Rent CPI ratio for each region, each year. This analysis looks at the average Price-to-Rent CPI ratio for different New Zealand regions from 2007 to 2020. A higher ratio generally means properties are relatively more expensive compared to the cost of renting in that area at that time. I thought a heatmap would be a good way to visualise this, as it can quickly show which regions and years had higher or lower ratios. I excluded the “National” figures here to get a better colour scale for the regional variations.

# Plot 10 in my R script: Price-to-Rent CPI Ratio Heatmap by Region and Year
# First, I need to calculate the average price_rent_cpi_ratio for each region and year.
# I'm filtering out any NA or non-finite values and also excluding the "National" region
# to let the regional differences show up better in the colour scale of the heatmap.
heatmap_data_price_rent_ratio <- full_data %>%
  filter(!is.na(price_rent_cpi_ratio) & is.finite(price_rent_cpi_ratio) & region != "National") %>%
  group_by(year, region) %>%
  summarise(avg_price_rent_cpi_ratio = mean(price_rent_cpi_ratio, na.rm = TRUE), .groups = "drop")

if(nrow(heatmap_data_price_rent_ratio) > 0){
  plot10 <- ggplot(heatmap_data_price_rent_ratio, 
                   aes(x = factor(year), # Treating year as a categorical factor for the x-axis
                       # Reordering regions on the y-axis based on their median ratio, highest first.
                       y = fct_reorder(region, avg_price_rent_cpi_ratio, .fun = median, .desc = TRUE, na.rm=TRUE), 
                       fill = avg_price_rent_cpi_ratio)) + # Colour intensity based on the ratio
    geom_tile(color = "white", linewidth=0.2) + # geom_tile creates the heatmap cells, with white borders.
    # Using the 'plasma' viridis colour scale, which is good for continuous data and colourblind-friendly.
    # Reversing direction so higher values are "hotter" (depends on palette choice).
    scale_fill_viridis_c(option = "plasma", name = "Avg. Price/\nRent CPI Ratio", labels = comma, direction = -1) +
    labs(
      title = "Heatmap: Average Price-to-Rent CPI Ratio", 
      subtitle="Excluding 'National'. 'Rest of...' regions use averaged prices.", 
      x = "Year", 
      y = "Region"
    ) +
    # Customising the theme a bit for a cleaner heatmap look.
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, size=rel(0.9)), # Angling year labels.
      legend.position = "right", 
      panel.grid = element_blank(), # Removing grid lines.
      panel.border = element_blank()  # Removing panel border.
    )
    # This plot will inherit the overall plot.background from theme_set.
  print(plot10)
} else { 
  print("Plot 5.10 (Original Plot 10): No data available after filtering for the heatmap.")
}
Plot 5.10: Heatmap of Average Price-to-Rent CPI Ratio by Region and Year.

Plot 5.10: Heatmap of Average Price-to-Rent CPI Ratio by Region and Year.

Insights

Auckland:
Auckland consistently displays the highest Price-to-Rent CPI ratios across all years, shown by the persistent dark purple/blue colors. There’s a clear upward trend from around 2012 onward, with the ratio peaking between 2016 and 2020. This suggests that property values in Auckland have increasingly outpaced rent levels, making it the most expensive region relative to rental returns.

Wellington:
Wellington generally has the second-highest Price-to-Rent CPI ratios, though still lower than Auckland. The region mostly shows orange to reddish-purple tones, indicating moderate to high ratios. The trend becomes more noticeable from around 2015, with ratios rising steadily into the later years, reflecting growing property prices relative to rents.

Canterbury:
Canterbury’s ratios fall in the mid-range compared to other regions, represented by orange and yellowish-orange shades. The pattern remains relatively stable over time, with a modest increase between 2013 and 2015, followed by a plateau or slight decline, then a small upward movement closer to 2020. This indicates moderate property price increases relative to rents.

Rest of North Island:
This region displays consistently lower ratios, indicated by mostly yellow and light orange colors. A gradual increase is visible over time, but the ratios remain significantly lower than those in Auckland or Wellington. This suggests that in the Rest of the North Island, property prices have remained more aligned with rent levels.

Rest of South Island:
The Rest of South Island shows the lowest Price-to-Rent CPI ratios throughout the period, represented by bright yellow tones. There is little change over time, with ratios remaining stable and low. This reflects a more balanced or affordable relationship between property prices and rental values in this region.

5.11. Smoothed Rent CPI (3-Month Rolling Average)

Sometimes the raw monthly data can be a bit noisy, making it harder to see the main trend. To help with this, I calculated a 3-month rolling average for the Rent CPI. Plotting this smoothed line alongside the actual CPI values can give a clearer picture of the underlying direction of rents in each region. I’ve faceted this by region again to compare.

# Plot 11 in my R script: Smoothed Rent CPI (3-month Rolling Avg) - Faceted by Broad region
# Filtering for key regions and ensuring both actual CPI and the rolling average are available.
plot11_data <- full_data %>%
  filter(region %in% key_regions_focus & !is.na(rent_cpi_roll_avg3) & !is.na(rent_cpi))

if(nrow(plot11_data) > 0){
  plot11 <- ggplot(plot11_data, aes(x = date)) +
    # Plotting the actual CPI with a lighter colour and some transparency.
    geom_line(aes(y = rent_cpi, colour = "Actual CPI"), alpha = 0.5, linewidth=0.7) +
    # Overlaying the 3-month rolling average with a more prominent line.
    geom_line(aes(y = rent_cpi_roll_avg3, colour = "3-Month Rolling Avg"), linewidth = 1.1) +
    facet_wrap(~region, scales = "free_y", ncol=3) + # Free y-scales for each region.
    # Setting specific colours for the lines to make them distinct.
    scale_colour_manual(values = c("Actual CPI" = "grey60", "3-Month Rolling Avg" = "steelblue3"), name="Metric:") +
    labs(
      title = "Smoothed Rent CPI (3-Month Rolling Average)", 
      subtitle="Faceted by Broad Region", 
      x = "Date", 
      y = "Rent CPI"
    ) +
    theme(legend.position = "top", strip.text = element_text(size=rel(0.85))) # Legend at the top for clarity.
  
  # This plot uses the global theme_set.
  print(plot11)
} else { 
  print("Plot 5.11 (Original Plot 11): No data available after filtering.")
}
Plot 5.11: Smoothed Rent CPI (3-Month Rolling Average) by Broad Region.

Plot 5.11: Smoothed Rent CPI (3-Month Rolling Average) by Broad Region.

Insights

Auckland:
Auckland’s 3-month rolling average Rent CPI shows a smooth and consistent upward trend, rising from around 1000 in 2007 to over 1500 by early 2020. The smoothed line minimizes monthly volatility and reveals a steady growth pattern with no major periods of sharp acceleration or stagnation.

Canterbury:
Canterbury exhibits a sharp rise in the smoothed Rent CPI from around 2011 to 2015, increasing from about 1100 to over 1400. After this period of rapid growth, the curve flattens significantly, indicating much slower growth or stabilization from 2015 to 2020, with the CPI hovering between 1400–1450.

Wellington:
Wellington’s smoothed Rent CPI follows a clear upward path, starting near 1000–1100 and rising to above 1500 by 2020. The growth appears to accelerate slightly from 2015 onward. The rolling average captures this overall upward shift while reducing the noise in monthly fluctuations.

Rest of North Island:
This region shows a strong and steady upward trend in the smoothed Rent CPI, increasing from around 1000 in late 2007 to about 1600 by early 2020. The growth rate appears consistent throughout the time period.

Rest of South Island:
The smoothed Rent CPI here increases from about 1000 to just over 1400, but the underlying actual CPI line shows much greater volatility. While the rolling average smooths out some of this noise, it still reflects more irregularities compared to other regions, with visible phases of faster and slower growth.

National:
The national 3-month rolling average Rent CPI rises steadily from 1000 to roughly 1500 between 2007 and 2020. The smoothing highlights a clear, nearly linear upward trend across the entire period, masking shorter-term fluctuations.

5.12. Volatility of Monthly Property Price Growth

After looking at the growth rates themselves, I was curious about how stable or volatile those growth rates have been. To measure this, I calculated a 12-month rolling standard deviation of the monthly property price growth for each region. A higher value on this plot means more volatility (bigger swings in the growth rate from month to month over the past year), while a lower value suggests more stable or predictable growth. I’ve faceted this by region to compare.

# Plot 12 in my R script: Volatility of Property Price Growth (12-month Rolling SD)
# Calculating the 12-month rolling standard deviation of price_growth_monthly for each region.
# Using zoo::rollapply for this.
plot12_data <- full_data %>%
  group_by(region) %>% # Calculation needs to be done per region.
  arrange(date) %>%    # Ensuring data is in chronological order for rollapply.
  mutate(
    price_growth_volatility_12m = zoo::rollapply(
      data = price_growth_monthly, 
      width = 12,         # 12-month window
      FUN = sd,           # Standard deviation function
      fill = NA,          # Fill initial periods (where full window not available) with NA
      align = "right",    # Align window to the right (current and past 11 months)
      na.rm = TRUE        # Remove NAs within the window before calculating sd
    )
  ) %>%
  ungroup() %>%
  # Filtering for key regions and valid volatility data.
  filter(region %in% key_regions_focus & !is.na(price_growth_volatility_12m) & is.finite(price_growth_volatility_12m))

if(nrow(plot12_data) > 0){
  plot12 <- ggplot(plot12_data, aes(x = date, y = price_growth_volatility_12m, color = region, group=region)) +
    geom_line(linewidth = 0.9, show.legend=FALSE) + # Legend off as faceting by region.
    # Y-axis as percentage, though it's a standard deviation of percentages.
    scale_y_continuous(labels = percent_format(accuracy=0.1)) +
    scale_color_brewer(palette = "Set1", name="Region:") + 
    labs(
      title = "Monthly Property Price Growth Volatility",
      subtitle="12-Month Rolling Standard Deviation, Faceted by Broad Region",
      x = "Date", 
      y = "Std. Dev of Monthly Price Growth"
    ) +
    facet_wrap(~region, scales="free_y", ncol=2) + # Free y-scales as volatility levels can differ.
    theme(strip.text = element_text(size=rel(0.85))) 
  
  # This plot uses the global theme_set.
  print(plot12)
} else { 
  print("Plot 5.12 (Original Plot 12): No data available after filtering.")
}
Plot 5.12: Volatility of Monthly Property Price Growth (12-Month Rolling Standard Deviation), Faceted by Broad Region.

Plot 5.12: Volatility of Monthly Property Price Growth (12-Month Rolling Standard Deviation), Faceted by Broad Region.

Insights

Auckland:

Auckland’s property price growth volatility fluctuated over time. It was relatively high (around 3.0%–3.5%) during the 2008–2009 period, dropped to about 2.0%–2.5% between 2010–2012, then peaked again around 2013–2014 (~4.5%). After that, volatility remained moderately high, generally moving between 2.5% and 4.0% through to 2020.

Canterbury:

Canterbury began with moderate volatility (~2.0%–2.5%) and saw a noticeable spike around 2012 (~4.0%). This was followed by a decreasing trend until about 2016 (dipping below 2.0%), before volatility increased again, peaking near 4.0% around 2019. These shifts reflect a cyclical volatility pattern.

Wellington:

Wellington had the highest volatility overall, starting around 4.0%–5.0% in 2008, dropping below 3.0% by 2013–2014, and then rising sharply to over 7.0% in 2016–2017. In recent years, volatility remained high, staying above 5.0%.

Rest of North Island:

This region showed a high volatility level around 2008–2009 (~3.0%), then another peak around 2011–2012 (~3.3%) and a stronger one around 2015 (~4.0%). From 2016 onwards, volatility became more stable and stayed lower, generally within the 1.5%–2.5% range.

Rest of South Island:

The Rest of the South Island experienced large swings. Volatility rose to ~4.0% by 2010, fell sharply to just above 1.0% around 2015, and then climbed again above 3.0% by 2020. This suggests alternating periods of stable and unstable price growth.

National:

National property price growth volatility was lowest around 2008 (~1.5%) and steadily increased to a peak near 5.0% around 2014–2015. Though it declined slightly afterward, volatility stayed relatively elevated (between 2.5% and 4.5%) for the rest of the period.

5.13. Top 5 Regions by Current Average Price-to-Rent CPI Ratio

To get a more current snapshot of the price-to-rent situation, I decided to look at the average Price-to-Rent CPI ratio for each region in the most recent full year of my data. I then filtered for the top 5 regions with the highest ratios, excluding “National” to focus on regional specifics. A higher ratio here could suggest that properties in these regions are relatively more expensive to buy compared to the cost of renting, or perhaps offer different investment dynamics.

# Plot 13 in my R script: Top 5 Regions by Current Average Price-to-Rent CPI Ratio
# First, I find the most recent full year in my data.
current_year_val <- max(full_data$year, na.rm = TRUE)

# Then, I calculate the average price_rent_cpi_ratio for that year, for each region,
# excluding "National" and any NA/non-finite ratios.
# Finally, I pick the top 5 regions with the highest average ratios.
top5_price_rent_regions <- full_data %>%
  filter(year == current_year_val & 
           !is.na(price_rent_cpi_ratio) & 
           is.finite(price_rent_cpi_ratio) & 
           region != "National") %>%
  group_by(region) %>%
  summarise(avg_ratio = mean(price_rent_cpi_ratio, na.rm = TRUE), .groups = "drop") %>%
  arrange(desc(avg_ratio)) %>%
  slice_head(n = 5) # dplyr::slice_head() is good for this

if(nrow(top5_price_rent_regions) > 0){
  plot13 <- ggplot(top5_price_rent_regions, aes(x = reorder(region, avg_ratio), y = avg_ratio, fill = region)) +
    geom_col(show.legend = FALSE, alpha=0.85) + # Bar chart, no legend needed as regions are on axis.
    # Adding the actual ratio value as text on the bars.
    geom_text(aes(label=sprintf("%.1f", avg_ratio)), hjust = -0.2, size=3.5, fontface="bold", color="grey20") +
    # Flipping coordinates to make it a horizontal bar chart - often easier to read region names.
    coord_flip(ylim=c(0, max(top5_price_rent_regions$avg_ratio, na.rm=T)*1.18)) + 
    scale_fill_brewer(palette = "Pastel1") + # Using the palette from your script.
    labs(
      title = paste("Top 5 Regions by Avg. Price-to-Rent CPI Ratio in", current_year_val),
      subtitle = "Excludes 'National'. 'Rest of...' categories use averaged prices for their median_price component.",
      x = "Region", 
      y = "Average Price-to-Rent CPI Ratio"
    ) +
    # Customising the theme slightly for a horizontal bar chart.
    theme(
      panel.grid.major.x = element_line(colour = "grey80"), # Keep horizontal grid lines
      panel.grid.major.y = element_blank() # Remove vertical grid lines for coord_flip
    )
    # This plot uses the global theme_set for overall background.
  print(plot13)
} else { 
  print("Plot 5.13 (Original Plot 13): No data available after filtering for top regions.")
}
Plot 5.13: Top 5 Regions by Average Price-to-Rent CPI Ratio in the Most Recent Year.

Plot 5.13: Top 5 Regions by Average Price-to-Rent CPI Ratio in the Most Recent Year.

Insights

Auckland:
Auckland had the highest average Price-to-Rent CPI ratio in 2020, reaching 596.6. This was well above all other regions, confirming that property prices in Auckland were significantly higher relative to rents, consistent with earlier observations.

Wellington:
Wellington followed with a ratio of 435.4, showing property prices were also high relative to rents but with a large gap compared to Auckland (over 160 points), placing it clearly in a second tier.

Canterbury:
With an average ratio of 332.4, Canterbury ranked third. This indicates a moderate level of price-to-rent imbalance, much lower than both Auckland and Wellington, suggesting greater affordability in comparison.

Rest of North Island:
This region had a ratio of 324.3, nearly the same as Canterbury, pointing to similar price-to-rent dynamics and relatively lower property price pressure.

Rest of South Island:
The lowest among the top five, with a ratio of 322.9, the Rest of South Island had property prices closest to rental levels, indicating the least disparity between the two metrics in 2020 among these regions.

5.15. Regional Price Growth vs. Rent CPI Growth Comparison

For my final EDA plot, I wanted to directly compare the growth rates of property prices and rents in the most recent full year of data I have. I calculated the average annualised monthly growth for both median_price and rent_cpi for each region (excluding “National” to focus on regional dynamics).

Plotting these against each other helps me see: * Which regions had high or low growth in both prices and rents. * Whether price growth and rent growth were similar (points close to the diagonal line where x=y). * Or if one was significantly outpacing the other (points far from the diagonal). The size of the points is mapped to the absolute difference between the two growth rates, highlighting regions where the divergence is largest. I have used ggrepel to label the points with region names so it’s easy to see which is which.

# Plot 15 in my R script: Regional Price & Rent CPI Growth Comparison
# First, I determine the most recent 'full' year in my data to ensure I'm comparing complete annual figures.
recent_full_year_val <- if(length(unique(full_data$year[!is.na(full_data$year)])) > 1) { 
                          # If more than one year of data, take the year before the latest one.
                          max(full_data$year, na.rm=TRUE) -1 
                        } else { 
                          # Otherwise, just use the latest year (might be incomplete but it's all I have).
                          max(full_data$year, na.rm=TRUE) 
                        }

# Calculating average annualised monthly growth for price and rent for each region in that year.
growth_comparison_data <- full_data %>%
  filter(year == recent_full_year_val & 
           !is.na(price_growth_monthly) & is.finite(price_growth_monthly) & # Ensure valid growth numbers
           !is.na(rent_growth_monthly) & is.finite(rent_growth_monthly) &
           region != "National") %>% # Excluding "National" to focus on regions.
  group_by(region) %>%
  # Annualising the average monthly growth rate by multiplying by 12.
  summarise(avg_annual_price_growth = mean(price_growth_monthly*12, na.rm = TRUE), 
            avg_annual_rent_growth = mean(rent_growth_monthly*12, na.rm = TRUE),   
            .groups = "drop") %>%
  drop_na() # Removing any regions where calculation might have failed.

if(nrow(growth_comparison_data) > 0){
  plot15 <- ggplot(growth_comparison_data, 
                   aes(x = avg_annual_rent_growth, y = avg_annual_price_growth)) +
    # Using shape 21 allows for both fill and colour aesthetics.
    # Size of point shows the absolute difference between price and rent growth.
    geom_point(aes(fill = region, size = abs(avg_annual_price_growth - avg_annual_rent_growth)), 
               alpha = 0.8, shape=21, stroke=0.6, color="white", show.legend=FALSE) + 
    # Diagonal line where price growth = rent growth.
    geom_abline(slope = 1, intercept = 0, linetype = "twodash", color = "grey40", linewidth=0.8) +
    # Using ggrepel to label regions without too much overlap.
    ggrepel::geom_text_repel(aes(label = region, colour=region), 
                             size = 3.5, max.overlaps = 20, 
                             segment.alpha = 0.6, fontface="italic",
                             box.padding = 0.5, point.padding = 0.5, 
                             show.legend = FALSE) + # Hiding legend for text labels too.
    scale_x_continuous(labels = percent_format()) + # X-axis as percentage.
    scale_y_continuous(labels = percent_format()) + # Y-axis as percentage.
    scale_fill_brewer(palette = "Set3", name="Region (Fill):") + # Colour palette for point fills.
    scale_colour_brewer(palette="Dark2", name="Region (Label):") + # Different palette for text labels.
    scale_size_continuous(name="Growth Difference (Abs):") + # Legend for point size.
    labs(
      title = paste("Annualised Property Price Growth vs. Rent CPI Growth in", recent_full_year_val),
      subtitle="Bubble size indicates absolute difference between price and rent growth. Excludes 'National'.",
      x = "Average Annualised Monthly Rent CPI Growth", 
      y = "Average Annualised Monthly Property Price Growth"
    ) +
    # Adjusting legend position if needed, or it will use the global theme setting.
    theme(legend.position = "right", legend.box = "vertical") 
  
  # This plot uses the global theme_set.
  print(plot15)
} else { 
  print("Plot 5.15 (Original Plot 15): No data available after filtering for growth comparison.")
}
Plot 5.15: Annualised Property Price Growth vs. Rent CPI Growth in the Most Recent Full Year.

Plot 5.15: Annualised Property Price Growth vs. Rent CPI Growth in the Most Recent Full Year.

Insights

Auckland:
Auckland showed moderate growth: property prices rose ~3–4%, and rents ~2.7%. It’s slightly above the diagonal, indicating a small price–rent growth difference. The small bubble highlights its more balanced market conditions in 2019.

Canterbury:
Canterbury had the lowest growth in both prices (~1.5–2%) and rents (~1.6%). It is positioned just below the diagonal, suggesting rent growth slightly outpaced property prices. The smallest bubble shows minimal divergence between price and rent growth.

Wellington:
Wellington also showed strong property price growth (~12–13%) and rent growth of about 4.1%. It is above the diagonal line, indicating a clear outperformance of prices over rents. The bubble is large, second only to the Rest of South Island.

Rest of North Island:
This region had the highest rent CPI growth (~5%) along with high property price growth (~12.5–13.5%). It lies above the diagonal, meaning prices rose faster than rents. The notable bubble size reflects the significant gap between these growth rates.

Rest of South Island:
This region recorded the highest annualised property price growth in 2019 (~13–14%), with rent CPI growth at around 3.8%. It sits well above the diagonal line, showing that property prices surged far more than rents. It also has the largest bubble, reinforcing this wide gap.

END OF EXPLANATORY DATA ANALSYS

6. Machine Learning: Predictive Modelling for Rent CPI

After exploring the data, my next big goal was to see if I could build models to predict the Rent Consumer Price Index (rent_cpi). I decided to try a few different machine learning algorithms, from a basic Linear Regression to more complex ensemble methods like Random Forest and XGBoost, to see which would perform best. The whole process involved careful data preparation specific to modelling, robust preprocessing using a recipe, cross-validation for tuning and evaluation, and finally, checking performance on an unseen test set.

6.1. Data Preparation for Modelling

Before I could train any models, I needed to prepare a specific dataset for this task. This involved: 1. Selecting the outcome variable (rent_cpi) and the predictor variables (features) I thought would be most relevant from my full_data table. This includes the features I engineered earlier. 2. Handling any missing values (NAs), especially for the target variable and key predictors, to ensure the models get clean data. 3. Explicitly creating numerical date-derived features like year, month, quarter, and half_year from the date column. My plan was to then convert month, quarter, and half_year into factors within the recipe so they could be properly dummy-coded for the models.

# --- 1. Define Feature Set for Modelling ---
# These are the columns I have chosen from 'full_data' to use as potential predictors.
# I'm also keeping 'date' and 'region' for now, as they'll be set to 'ID' roles in the recipe.
ml_feature_cols <- c(
  "median_price",           # Median property price in the region.
  "rented_ratio_national",  # National proportion of dwellings that are rented.
  "owner_ratio_national",   # National proportion of owner-occupied dwellings.
  "free_ratio_national",    # National proportion of dwellings provided free.
  "price_rent_cpi_ratio",   # Ratio of median property price to Rent CPI in the region.
  "log_median_price",       # Log-transformed median property price.
  "price_growth_monthly",   # Monthly growth rate of median property price in the region.
  "rent_growth_monthly"     # Monthly growth rate of Rent CPI in the region.
)

# --- 2. Initial Data Selection ---
# Creating 'ml_data_pre_split' by selecting the outcome, IDs, and the features defined above.
ml_data_pre_split <- full_data %>%
  select(rent_cpi, date, region, all_of(ml_feature_cols)) %>%
  drop_na(rent_cpi) # Crucial: Rows with a missing target variable can't be used.

# --- 3. Sanity Checks & Further NA Handling for Predictors ---
# This block ensures the data going into the ML pipeline is viable.
if (nrow(ml_data_pre_split) > 0) {
  print("--- ML Data Sanity Checks ---")
  print(paste("Rows before dropping NAs in predictors:", nrow(ml_data_pre_split)))
  
  # Further dropping rows if any of these specific key predictors have NAs.
  # While the recipe will handle some imputation later, this initial drop ensures that
  # rows with missing data in critical analytical features are removed.
  ml_data_final <- ml_data_pre_split %>%
    drop_na(any_of(c("median_price", "price_rent_cpi_ratio", 
                     "rented_ratio_national", "price_growth_monthly", "rent_growth_monthly")))
  
  print(paste("Rows after dropping NAs in key predictors:", nrow(ml_data_final)))
  
  # Checking if the target variable has sufficient variance after all this filtering.
  # If variance is very low or zero, modelling might not be meaningful.
  if (nrow(ml_data_final) > 0 && var(ml_data_final$rent_cpi, na.rm = TRUE) < 1e-6) {
    print("Warning: Target 'rent_cpi' has near-zero variance after filtering. ML might not be meaningful.")
    ml_data_final <- data.frame() # Creating an empty dataframe to gracefully skip ML if target variance is an issue.
  }
} else {
  ml_data_final <- data.frame() # Starting with an empty dataframe if pre_split was already empty.
}
## [1] "--- ML Data Sanity Checks ---"
## [1] "Rows before dropping NAs in predictors: 972"
## [1] "Rows after dropping NAs in key predictors: 972"
# --- 4. Add Date-Derived Features Explicitly ---
# Creating numerical year, month, quarter, and half_year features from the 'date' column.
# My recipe will later convert 'month', 'quarter', and 'half_year' into factors for modelling.
if (nrow(ml_data_final) > 0) {
  ml_data_final <- ml_data_final %>%
    mutate(
      year = lubridate::year(date),
      month = lubridate::month(date), # Will be converted to factor in recipe
      quarter = lubridate::quarter(date), # Will be converted to factor in recipe
      half_year = ifelse(month %in% 1:6, 1, 2) # Will be converted to factor in recipe
    )
}

# Quick check of the data that will be used for modelling.
print("Sample of ml_data_final (first few rows, selected columns):")
## [1] "Sample of ml_data_final (first few rows, selected columns):"
if(nrow(ml_data_final) > 0) print(head(select(ml_data_final, date, region, rent_cpi, median_price, year, month)))
## # A tibble: 6 × 6
##   date       region               rent_cpi median_price  year month
##   <date>     <chr>                   <dbl>        <dbl> <dbl> <dbl>
## 1 2007-01-01 Auckland                 1007      419500   2007     1
## 2 2007-01-01 Wellington               1010      350000   2007     1
## 3 2007-01-01 Rest of North Island     1003      280571.  2007     1
## 4 2007-01-01 Canterbury               1023      294800   2007     1
## 5 2007-01-01 Rest of South Island     1052      257333.  2007     1
## 6 2007-01-01 National                 1025      325500   2007     1

This ml_data_final table is what I will use for splitting into training and testing sets in the next step. Making sure it’s clean and has the right initial features is pretty important before diving into the recipe and model training.

6.2. Data Splitting and Preprocessing Recipe

With my ml_data_final table ready, the next crucial step before training any models was to split it into a training set and a testing set. I decided on a standard 80/20 split, meaning 80% of the data would be used to train the models, and the remaining 20% would be held back as unseen data to test how well the models generalise. To make sure both sets were representative, I used stratified sampling based on the outcome variable, rent_cpi. This helps ensure that the distribution of rent_cpi values is similar in both my training and testing sets. Setting a seed (set.seed(123)) means this split will be the same every time I run the code, which is great for reproducibility.

Once I had my training data, I defined a preprocessing recipe using the recipes package (part of tidymodels). This is a really powerful way to outline all the data preparation steps that need to happen. The beauty of a recipe is that these steps are defined once and then consistently applied, especially during cross-validation and when processing the test set later on.

My recipe for this project includes several steps: * Role Assignment: I first told the recipe that region and the original date column are just ‘ID’ variables, so they won’t be used as predictors by default. * Date Feature Conversion: I used step_mutate to convert the numeric month, quarter, and half_year columns (that I created earlier from the date) into proper factors with meaningful labels (like “Jan”, “Feb” for months; “Q1”, “Q2” for quarters). This is important so they can be correctly turned into dummy variables. * Handling New Factor Levels: step_novel is included to deal with any new factor levels that might pop up in new data (like in a CV fold or the test set) but weren’t in the original training data used to prep the recipe. * Zero Variance Predictors: step_zv removes any predictors that have only one unique value (i.e., zero variance), as these offer no predictive information. * Imputation: step_impute_median fills in any remaining missing values in numeric columns using the median calculated from the training set. This ensures the models don’t break due to NAs. * Dummy Variables: step_dummy converts all my nominal (factor) predictors into numeric dummy variables. I used one_hot = FALSE, which creates k-1 dummies for a k-level factor. This is generally preferred for models like linear regression that include an intercept, as it avoids perfect multicollinearity from the dummy set. * Normalisation: step_normalize scales all numeric predictors to have a mean of 0 and a standard deviation of 1. This can help some algorithms perform better. * Correlation Filter: Finally, step_corr removes one of a pair of numeric predictors if their absolute correlation is very high (I set the threshold at 0.9), which helps reduce multicollinearity.

After defining all these steps, I prepped the recipe using the training data. This ‘prepping’ step estimates any parameters needed from the training data (like medians for imputation, means/standard deviations for normalisation, etc.). This prepped_recipe_for_models object is what I’ll use to apply these transformations consistently.

# Proceeding only if I have enough data after the previous preparation steps.
if (nrow(ml_data_final) > 100) { 
  
  # --- 5a. Data Splitting ---
  # Setting a seed for random number generation to ensure my data split is reproducible every time I run the script.
  set.seed(123) 
  # Splitting the data: 80% for training, 20% for testing.
  # Stratifying by 'rent_cpi' to maintain similar outcome distributions in both sets, which is good practice.
  split <- initial_split(ml_data_final, prop = 0.8, strata = rent_cpi)
  train_data <- training(split)
  test_data <- testing(split)
  
  print(paste("Training data rows:", nrow(train_data)))
  print(paste("Test data rows:", nrow(test_data)))
  
  # --- 5b. Define Preprocessing Recipe ---
  # The recipe outlines a series of steps to prepare data for modelling.
  # The formula 'rent_cpi ~ .' means I'm predicting 'rent_cpi' using all other columns 
  # in 'train_data' as predictors by default, unless their roles are changed.
  rec <- recipe(rent_cpi ~ ., data = train_data) %>%
    # 'region' and 'date' are assigned 'ID' roles: they won't be used as predictors by default
    # but are kept in the data for potential reference or specific time-series handling later.
    update_role(region, new_role = "ID") %>% 
    update_role(date, new_role = "ID") %>%   # Original date column is now an ID.
    
    # Converting my manually created numeric month, quarter, and half_year into factors
    # with proper labels. This is important before creating dummy variables.
    step_mutate( 
        month = factor(month, levels = 1:12, labels = month.abb, ordered = FALSE), # e.g., 1 -> "Jan"
        quarter = factor(quarter, levels = 1:4, labels = paste0("Q", 1:4), ordered = FALSE), # e.g., 1 -> "Q1"
        half_year = factor(half_year, levels = 1:2, labels = paste0("H", 1:2), ordered = FALSE) # e.g., 1 -> "H1"
    ) %>%
    # Handles any new factor levels that might appear in test/CV data but weren't 
    # in the training set when the recipe was prepped. It assigns them to a special 'new' level.
    step_novel(all_nominal_predictors(), all_factor_predictors(), -all_outcomes(), -has_role("ID")) %>% 
    # Removes predictors that have only one unique value (zero variance), as they offer no info.
    step_zv(all_predictors()) %>% 
    # Imputes missing values in numeric columns using the median calculated from the training set.
    step_impute_median(all_numeric_predictors()) %>% 
    # Creates dummy (binary 0/1) variables from all nominal (factor) predictors.
    # `one_hot = FALSE` means k-1 dummies are created for a k-level factor, good for models with an intercept.
    step_dummy(all_nominal_predictors(), -all_outcomes(), -has_role("ID"), one_hot = FALSE) %>% 
    # Normalises (centres to mean 0 and scales to std dev 1) all numeric predictors.
    # This helps algorithms that are sensitive to feature scales.
    step_normalize(all_numeric_predictors()) %>%
    # Removes one of a pair of numeric predictors if their absolute correlation is above 0.9.
    # This helps reduce multicollinearity.
    step_corr(all_numeric_predictors(), threshold = 0.9)
  
  print("Recipe defined. Ready for model training.")
  
  # Prepping the recipe once on the training data. 
  # This 'prepped_recipe_for_models' object now contains all the information needed 
  # to apply these exact transformations to any new data (like CV folds or the test set).
  prepped_recipe_for_models <- prep(rec, training = train_data)

  # I can inspect the prepped recipe to see what it learned, e.g., which columns were dummied or removed.
  # summary(prepped_recipe_for_models) 
  # tidy(prepped_recipe_for_models, number = x) # for step x

} else {
  # This part ensures that if there wasn't enough data to begin with, 
  # the script notes it and later model training steps will be skipped.
  print("Not enough data for ML modelling after initial preparation. Skipping recipe creation and model training.")
  # Setting these to NULL so later chunks can check if models were trained.
  train_data <- test_data <- NULL # No data to split
  rec <- prepped_recipe_for_models <- NULL 
  model_lm <- model_lasso <- model_rf <- model_xgb <- NULL
  results <- NULL 
}
## [1] "Training data rows: 776"
## [1] "Test data rows: 196"
## [1] "Recipe defined. Ready for model training."

With my data split and a solid preprocessing recipe in place and prepped, I am now all set to actually train and tune the different machine learning models.

6.3. Model Training and Tuning

With my training data prepared and my preprocessing recipe defined and prepped, I was ready to train the actual machine learning models. My strategy here was to use 10-fold cross-validation for robust performance estimation and for tuning the hyperparameters of the more complex models. I set this up using caret’s trainControl() function. I also enabled savePredictions = "final" which is useful if I want to look at the out-of-fold predictions later, and allowParallel = TRUE which can speed up training if I have a parallel backend like doParallel registered (though for this script, I have not explicitly registered one, caret might still use available cores on some systems).

I decided to train four types of models to compare their performance: 1. Linear Regression (lm): A good, simple baseline model. 2. Lasso Regression (glmnet): This is a regularised linear model that can also perform feature selection by shrinking some coefficients exactly to zero. I tuned its lambda (regularisation strength) parameter. 3. Random Forest (rf): A powerful ensemble method based on decision trees. I fixed the number of trees (ntree) to 50 (based on my earlier experimentation where more trees did not offer significant gains for the added computation) and tuned the mtry parameter (number of variables randomly sampled at each split) using a dynamically generated grid. 4. XGBoost (xgbTree): Another very powerful gradient boosting ensemble method. I set up a grid (xgb_grid) to tune several of its key hyperparameters like the number of rounds, tree depth, and learning rate.

For all models, I used “RMSE” (Root Mean Squared Error) as the primary metric to optimise during tuning, as it is a common and interpretable measure of prediction error for regression tasks.

The R code chunk below runs the training for all these models. This can take a bit of time, especially for Random Forest and XGBoost with hyperparameter tuning. So, in this R Markdown file, I have set the chunk options results='hide' to prevent all the verbose training logs from caret from cluttering the final report, and cache=TRUE. The cache=TRUE option is really useful because R Markdown will save the results of this chunk (the trained models) and will only re-run the training if the code in this chunk or the data it depends on actually changes. This saves a lot of time when I re-knit the document. I will print out summaries of the models and their performance in the subsequent sections.

# This chunk will run the model training. 
# Output logs are hidden in the R Markdown report due to 'results='hide''.
# 'cache=TRUE' will save computation time on subsequent knits if code/data unchanged.

# Checking if I have enough data and if the recipe was prepped before trying to train models.
# This 'if' block should ideally not be strictly necessary here if the Rmd is knitted top-to-bottom,
# as the objects would exist from the previous chunk. However, it adds robustness if running chunks interactively.
if (exists("ml_data_final") && nrow(ml_data_final) > 100 && 
    exists("prepped_recipe_for_models") && !is.null(prepped_recipe_for_models) &&
    exists("train_data") && !is.null(train_data) ) {
  
  # Defining my cross-validation strategy using caret's trainControl.
  ctrl <- trainControl(
    method = "cv",           # Using k-fold Cross-validation.
    number = 10,             # Using 10 folds.
    savePredictions = "final", # Saves out-of-fold predictions for each resample (useful for some analyses).
    allowParallel = TRUE,    # Allows parallel processing if a backend like 'doParallel' is registered.
    search = "grid"          # Using grid search when a tuneGrid is supplied.
  )
  
  # --- Linear Regression (LM) ---
  # My first model, a standard linear regression. It serves as a good baseline.
  # Caret uses the 'prepped_recipe_for_models' to preprocess data automatically during training and CV.
  print("Training Linear Model...")
  model_lm <- train(prepped_recipe_for_models, # Using the prepped recipe
                    data = train_data, 
                    method = "lm", 
                    trControl = ctrl, 
                    metric = "RMSE") # Optimising for RMSE.
  
  # --- Lasso Regression (glmnet) ---
  # Lasso performs L1 regularisation, which can also do feature selection.
  # I am only tuning 'lambda' here (alpha=1 for Lasso).
  print("Training Lasso Model...")
  lasso_grid <- expand.grid(alpha = 1, lambda = 10^seq(-4, -1, length.out = 10)) # Defining a grid of 10 lambda values.
  model_lasso <- train(prepped_recipe_for_models, 
                       data = train_data, 
                       method = "glmnet",
                       tuneGrid = lasso_grid,
                       trControl = ctrl, 
                       metric = "RMSE")
  
  # --- Random Forest (rf) ---
  # An ensemble method that is usually quite powerful. 
  # I have set ntree=50 based on my earlier experiments.
  # Caret will tune 'mtry' (number of variables randomly sampled at each split).
  print("Training Random Forest Model...")
  
  # Dynamically creating a sensible mtry grid based on the actual number of predictors 
  # *after* the recipe has been preprocessed on the training data. This is important because
  # steps like dummy variable creation or correlation removal can change the number of predictors.
  temp_juiced_train_rf <- juice(prepped_recipe_for_models) # Getting the processed training data.
  # Number of predictors is total columns minus 1 (for the outcome variable 'rent_cpi').
  num_actual_predictors_rf <- ncol(temp_juiced_train_rf) - 1 
  
  # Defining a set of mtry candidates. These are common heuristics.
  mtry_candidates_rf <- unique(floor(c(num_actual_predictors_rf/3, sqrt(num_actual_predictors_rf), 
                                   num_actual_predictors_rf*0.2, num_actual_predictors_rf*0.5)))
  # Ensuring candidates are valid (greater than 0 and not more than total predictors).
  mtry_candidates_rf <- mtry_candidates_rf[mtry_candidates_rf > 0 & mtry_candidates_rf <= num_actual_predictors_rf]
  if(length(mtry_candidates_rf) == 0) mtry_candidates_rf <- max(1, floor(num_actual_predictors_rf/3)) # Fallback
  if(any(is.na(mtry_candidates_rf)) || length(mtry_candidates_rf) == 0) mtry_candidates_rf <- 2 # Absolute fallback
  
  rf_tuning_grid_final <- expand.grid(.mtry = unique(sort(mtry_candidates_rf[mtry_candidates_rf>0])))
  # print("Random Forest mtry tuning grid to be used:"); print(rf_tuning_grid_final) # Can uncomment to see the grid
  
  model_rf <- train(prepped_recipe_for_models, 
                    data = train_data, 
                    method = "rf", 
                    ntree = 50, # Fixed number of trees.
                    tuneGrid = rf_tuning_grid_final, # Using my defined mtry grid.
                    trControl = ctrl, 
                    metric = "RMSE",
                    importance = TRUE) # Calculating variable importance measures.
  
  # --- XGBoost (xgbTree) ---
  # Another very powerful gradient boosting ensemble. I am tuning several key parameters using a grid search.
  # This can be quite computationally intensive, so the grid size is a balance.
  print("Training XGBoost Model...")
  xgb_grid <- expand.grid(
    nrounds = c(100, 300, 500),     # Number of boosting rounds (trees).
    max_depth = c(3, 6, 9),         # Maximum depth of each tree.
    eta = c(0.01, 0.1, 0.3),        # Learning rate (shrinkage).
    gamma = 0,                      # Minimum loss reduction for a split (kept fixed at 0 for this grid).
    colsample_bytree = 0.8,         # Fraction of columns sampled per tree (kept fixed).
    min_child_weight = 1,           # Minimum sum of instance weights in a child node (kept fixed).
    subsample = 0.8                 # Fraction of training data sampled per tree (kept fixed).
  ) # This results in 3*3*3 = 27 combinations for caret to test.
  
  model_xgb <- train(prepped_recipe_for_models, 
                     data = train_data, 
                     method = "xgbTree",
                     tuneGrid = xgb_grid,
                     trControl = ctrl, 
                     metric = "RMSE",
                     verbosity = 0) # Suppressing XGBoost's own messages during training.
} else {
  print("Skipping model training as pre-requisites (sufficient data or prepped recipe) are not met.")
}

I have set results='hide' for the chunk above because the training process can print a lot of information to the console, which would make the report very long. The cache=TRUE option is great because it means R Markdown will save the results of this chunk (the trained models) and will only re-run the training if the code in this chunk or the data it depends on actually changes. This saves me a lot of time when I re-knit the document.

I will print out summaries of the models and their performance metrics in the next sections of the report.

6.4. Model Performance Comparison (Cross-Validation)

Now that I have trained my different models (Linear Regression, Lasso, Random Forest, and XGBoost), the next step is to compare how they performed during the 10-fold cross-validation process. The caret package makes this pretty straightforward with the resamples() function, which collects all the performance metrics (like RMSE, R-squared, and MAE) from each fold for each model.

First, I will print a summary table of these resampled metrics. This gives me a statistical overview (min, median, mean, max) of how each model did across the different folds. Then, I will use dotplot() to visualise these distributions, which makes it easier to quickly compare the models, especially looking at their median performance and the spread of their results. I am particularly interested in RMSE (lower is better) and R-squared (higher is better).

# This chunk assumes that 'model_lm', 'model_lasso', 'model_rf', and 'model_xgb' 
# were successfully trained in the previous chunk (ml_train_models_chunk).

# Checking if the model objects exist and are not NULL before proceeding.
if (exists("model_lm") && !is.null(model_lm) && 
    exists("model_lasso") && !is.null(model_lasso) && 
    exists("model_rf") && !is.null(model_rf) && 
    exists("model_xgb") && !is.null(model_xgb)) { 
      
  # Collecting all trained model objects into a list for comparison using resamples().
  model_list_for_resamples <- list(LM = model_lm, Lasso = model_lasso, RF = model_rf, XGB = model_xgb)
  # Getting the resampling results.
  results <- resamples(model_list_for_resamples)
  
  print("--- Resamples Summary (Cross-Validation Metrics) ---")
  # This summary gives me stats like Min, Median, Mean, Max for each metric (RMSE, Rsquared, MAE) 
  # across the 10 cross-validation folds for each model.
  print(summary(results)) 
  
  # Now, I will create dot plots to visualise these comparisons.
  # It is important to check if the 'results' object and its 'values' component are valid before plotting.
  if (!is.null(results) && !is.null(results$values) && nrow(results$values) > 0 && !any(is.na(results$values))) {
    
    print("--- Generating Model Comparison Dot Plots ---")
    
    # For RMSE dotplot
    # Using the 'main' argument, passed as a list, to set the title for these lattice-based plots.
    # Calling dotplot() directly as it's an S3 method.
    tryCatch({
      plot_cv_rmse <- dotplot(results, 
                              metric = "RMSE", 
                              main = list(label = "Model Comparison (Cross-Validation RMSE)", cex = 1.1, col = "grey10")
                              )
      if (!is.null(plot_cv_rmse)) {
        print(plot_cv_rmse)
      } else {
        print("RMSE dotplot object was NULL or failed to generate.")
      }
    }, error = function(e_rmse_plot) {
      print(paste("Error generating RMSE dotplot:", e_rmse_plot$message))
      print("As a test, you could try running just: dotplot(results, metric='RMSE') in your console.")
    })
    
    # For Rsquared dotplot
    tryCatch({
      plot_cv_rsquared <- dotplot(results, 
                                  metric = "Rsquared", 
                                  main = list(label = "Model Comparison (Cross-Validation R-squared)", cex = 1.1, col = "grey10")
                                  )
      if (!is.null(plot_cv_rsquared)) {
        print(plot_cv_rsquared)
      } else {
        print("Rsquared dotplot object was NULL or failed to generate.")
      }
    }, error = function(e_rsq_plot) {
      print(paste("Error generating Rsquared dotplot:", e_rsq_plot$message))
      print("As a test, you could try running just: dotplot(results, metric='Rsquared') in your console.")
    })
    
  } else { 
    # This message appears if there was an issue with the 'results' object itself.
    print("No valid metric values for resamples plot, or 'results' object is NULL/NA.")
    if(exists("results") && !is.null(results) && !is.null(results$values) && nrow(results$values) > 0 && any(is.na(results$values))){
        # This specific check for NAs within results$values is important.
        print("NA/NaN values were detected in resamples metrics earlier. This might be why plots cannot be generated.")
        print("ACTION: Run warnings() in your R console to investigate specific warnings generated during training.")
    }
  }
} else {
  # This message appears if the models themselves were not trained in the previous step.
  print("One or more ML models were not trained, so skipping results summary and comparison plots.")
}
## [1] "--- Resamples Summary (Cross-Validation Metrics) ---"
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LM, Lasso, RF, XGB 
## Number of resamples: 10 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LM    56.84831 66.88404 71.84790 70.12038 73.85008 78.75437    0
## Lasso 62.34046 66.55180 70.11223 69.94094 72.04063 80.57571    0
## RF    17.45075 21.17442 22.83042 22.78585 25.28783 26.62823    0
## XGB   16.26965 17.15206 17.87772 19.11131 20.26289 24.59136    0
## 
## RMSE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## LM    71.78156 79.95613 86.00672 84.34394 88.08922 96.33101    0
## Lasso 73.65286 82.07546 84.18448 84.17320 86.07776 96.60066    0
## RF    26.27526 30.04312 34.93598 33.81747 37.48122 39.86072    0
## XGB   22.83075 25.68027 27.64934 28.64640 31.36303 35.50130    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LM    0.6539824 0.6958971 0.7172894 0.7158083 0.7366345 0.7724865    0
## Lasso 0.6660041 0.7026623 0.7141840 0.7153712 0.7232549 0.7655254    0
## RF    0.9372410 0.9467224 0.9538050 0.9550157 0.9661711 0.9712157    0
## XGB   0.9517004 0.9622868 0.9709383 0.9677934 0.9741979 0.9782992    0
## 
## [1] "--- Generating Model Comparison Dot Plots ---"

Insights

1. Summary of Cross-Validation Metrics (from Console Output)

The console output provides a statistical summary (Minimum, 1st Quartile, Median, Mean, 3rd Quartile, Maximum) for MAE, RMSE, and R-squared across the 10 resamples (folds) for each model.

a. Mean Absolute Error (MAE):

  • LM (Linear Model): Median MAE is 71.85; Mean MAE is 70.12.
  • Lasso: Median MAE is 70.11; Mean MAE is 69.94. (Slightly better than LM on average).
  • RF (Random Forest): Median MAE is 22.83; Mean MAE is 22.79. (Significantly lower than LM and Lasso).
  • XGB (XGBoost): Median MAE is 17.88; Mean MAE is 19.11. (The lowest MAE, indicating the best performance on this metric).

Insight: XGBoost has the lowest MAE, followed closely by Random Forest. Both tree-based models (RF and XGB) perform substantially better than the linear models (LM and Lasso) in terms of average prediction error.

b. Root Mean Squared Error (RMSE):

  • LM: Median RMSE is 86.01; Mean RMSE is 84.34.
  • Lasso: Median RMSE is 84.18; Mean RMSE is 84.17. (Again, slightly better than LM).
  • RF: Median RMSE is 34.94; Mean RMSE is 33.82. (Much lower than LM and Lasso).
  • XGB: Median RMSE is 27.65; Mean RMSE is 28.65. (The lowest RMSE, indicating the best performance).

Insight: Similar to MAE, XGBoost achieves the lowest RMSE, followed by Random Forest. RMSE penalizes larger errors more heavily than MAE, and XGBoost’s superior performance here suggests it makes fewer large errors compared to RF, and significantly fewer than LM and Lasso.

c. R-squared (R²):

  • LM: Median R² is 0.717; Mean R² is 0.716.
  • Lasso: Median R² is 0.714; Mean R² is 0.715. (Very similar to LM).
  • RF: Median R² is 0.954; Mean R² is 0.955. (Significantly higher, indicating a much better fit).
  • XGB: Median R² is 0.971; Mean R² is 0.968. (The highest R², indicating the best proportion of variance explained).

Insight: XGBoost has the highest R-squared, meaning it explains the most variance in the target variable, closely followed by Random Forest. Both RF and XGB explain over 95% of the variance on average, while LM and Lasso explain around 71-72%.


2. Analysis of Model Comparison Dot Plots

The dot plots visually confirm the findings from the summary table and provide insights into the consistency of model performance across the cross-validation folds. Each point on the dot plot represents the metric value for one fold, and the larger dot is typically the median.

a. RMSE Dot Plot:

  • LM and Lasso: These models have RMSE values clustered at the higher end of the scale (mostly between ~70 and ~97). Their distributions are relatively wide, indicating some variability in performance across folds, but they are clearly separated from and perform worse than RF and XGB.
  • RF: The RMSE values for RF are much lower, centered around a median of ~35. The spread of results is tighter than for LM/Lasso, but wider than for XGB.
  • XGB: XGBoost shows the lowest RMSE values, with a median around ~28. Its distribution is also relatively tight, suggesting consistent performance across folds and superiority over the other models.

Insight: The dot plot clearly shows a distinct performance advantage for RF and XGBoost over LM and Lasso in terms of RMSE. XGBoost is visually the best performer with the lowest and most tightly clustered RMSE values.

b. R-squared Dot Plot:

  • LM and Lasso: Their R-squared values are clustered at the lower end (mostly between ~0.65 and ~0.77). They perform similarly to each other.
  • RF: RF’s R-squared values are much higher, centered around a median of ~0.95. The distribution is relatively tight.
  • XGB: XGBoost has the highest R-squared values, with a median around ~0.97. Its distribution is also tight, indicating consistently high explanatory power.

Insight: The dot plot confirms that RF and XGBoost provide a much better fit to the data than LM and Lasso. XGBoost again appears slightly superior to RF, with its R-squared values consistently at the top end.


Overall Conclusion:

Based on all three metrics (MAE, RMSE, and R-squared) from the 10-fold cross-validation:

XGBoost (XGB) consistently emerges as the best-performing model. It has the lowest median and mean MAE and RMSE, and the highest median and mean R-squared. Its performance is also relatively consistent across the cross-validation folds.

Random Forest (RF) is a strong second-best performer, significantly outperforming the linear models and achieving metrics close to XGBoost.

Linear Model (LM) and Lasso Regression perform substantially worse than RF and XGBoost on all metrics. Their error rates are much higher, and they explain significantly less variance in the data. Lasso offers a marginal improvement over the standard Linear Model in some cases but does not bridge the gap to the tree-based ensemble methods.

Therefore, for predictive accuracy and goodness-of-fit on this particular dataset and task, XGBoost is the preferred model, followed closely by Random Forest.

6.5. Evaluation on Test Set

Cross-validation gave me a good idea of how my models might perform and helped me tune their hyperparameters. However, the true test of a model’s ability to generalise to new, unseen data is its performance on the test set, which I held back and did not use at all during training or tuning.

For this final evaluation, I will take my trained models (Linear Regression, Lasso, Random Forest, and XGBoost, all using their tuned hyperparameters where applicable) and use them to make predictions on the test_data. I will then calculate the Root Mean Squared Error (RMSE) for each model to compare their predictive accuracy on this hold-out data.

I will also print out some key details for each final model, like the summary for the linear model and the best tuning parameters found for Lasso, Random Forest, and XGBoost. This helps document the final model configurations.

# This chunk assumes 'model_list_for_resamples' (containing all trained models)
# and 'test_data' are available from the previous chunks.

# Checking if I have the list of models and the test data available.
if (exists("model_list_for_resamples") && 
    !is.null(model_list_for_resamples) && 
    length(model_list_for_resamples) > 0 && 
    exists("test_data") && !is.null(test_data) && nrow(test_data) > 0) {
      
  print("--- RMSE on Test Set ---")
  # Calculating RMSE for each model on the held-out test set.
  # I am using yardstick::rmse_vec for this, which is part of the tidymodels ecosystem.
  rmse_values <- sapply(model_list_for_resamples, function(mod) { 
    # Defensive check for a valid model object.
    if(is.null(mod) || !("train" %in% class(mod))) {
      print(paste("Warning: A model object in list is NULL or not a caret 'train' object for RMSE calculation."))
      return(NA_real_) # Returning NA if the model object is not as expected.
    }
    
    tryCatch({ # Adding tryCatch in case prediction fails for a specific model on the test set.
        predictions <- predict(mod, newdata = test_data) # Making predictions on the test set.
        actuals <- test_data$rent_cpi                 # The true values from the test set.
        
        # Basic checks before calculating RMSE to avoid errors.
        if (length(predictions) != length(actuals)) {
          print(paste("Prediction length mismatch for a model. Preds:", length(predictions), "Actuals:", length(actuals)))
          return(NA_real_)
        }
        if (all(is.na(predictions))) {
          print("All predictions are NA for a model on the test set.")
          return(NA_real_) 
        }
        # If there are some NAs in predictions or actuals (though recipe should handle NAs in predictors),
        # rmse_vec will calculate RMSE on complete pairs.
        if (any(is.na(predictions)) || any(is.na(actuals)) ) {
           print("NAs found in predictions or actuals for test set RMSE calculation. RMSE will be calculated on complete pairs.")
        }
        
        return(yardstick::rmse_vec(truth = actuals, estimate = predictions))
    }, error = function(e) { 
        print(paste("Error predicting with a model on test set:", e$message)); 
        return(NA_real_)
    })
  })
  print(rmse_values) # Displaying the test set RMSE for each model.
  
  # --- Printing Key Model Summaries / Best Tunes ---
  # This helps to document the final configuration of each model.
  if(exists("model_lm") && !is.null(model_lm$finalModel)) {
      print("--- Final Linear Model Details (trained on full training data) ---")
      # The summary for lm shows coefficients, R-squared for the training data, etc.
      # It also shows which coefficients were not defined due to singularities (multicollinearity).
      print(summary(model_lm$finalModel)) 
  }
  if(exists("model_lasso") && !is.null(model_lasso$bestTune)) {
      print(paste("--- Best tuning parameters for Lasso: lambda =", round(model_lasso$bestTune$lambda, 5)))
  }
  if(exists("model_rf") && !is.null(model_rf$bestTune)) {
      print(paste("--- Best tuning parameters for Random Forest: mtry =", model_rf$bestTune$mtry))
      # Variable importance can be very insightful for Random Forest.
      # I can uncomment this to plot it if I want to see it in the report.
      # print("Random Forest Variable Importance (Top 10):")
      # rf_var_imp <- varImp(model_rf, scale = FALSE) 
      # print(plot(rf_var_imp, top = 10, main = "Top 10 Important Variables - Random Forest"))
  }
  if(exists("model_xgb") && !is.null(model_xgb$bestTune)) {
      print("--- Best tuning parameters for XGBoost: ---")
      print(model_xgb$bestTune)
      # XGBoost also provides variable importance.
      # print("XGBoost Variable Importance (Top 10):")
      # xgb_var_imp <- varImp(model_xgb, scale = FALSE)
      # print(plot(xgb_var_imp, top = 10, main = "Top 10 Important Variables - XGBoost"))
  }
  
} else {
  print("Model list for resamples not available, or test_data is empty/NULL. Skipping test set evaluation.")
}
## [1] "--- RMSE on Test Set ---"
##       LM    Lasso       RF      XGB 
## 76.54330 76.36756 28.53961 26.77484 
## [1] "--- Final Linear Model Details (trained on full training data) ---"
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -143.734  -68.704   -4.457   53.174  264.002 
## 
## Coefficients: (7 not defined because of singularities)
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1244.0284     3.0117 413.067  < 2e-16 ***
## owner_ratio_national -118.0231     3.0932 -38.156  < 2e-16 ***
## price_rent_cpi_ratio   42.8195     3.1068  13.782  < 2e-16 ***
## price_growth_monthly   -2.2454     3.3803  -0.664    0.507    
## rent_growth_monthly    13.9697     3.1573   4.425 1.11e-05 ***
## month_Feb               4.3937     4.3275   1.015    0.310    
## month_Mar               5.2725     4.4433   1.187    0.236    
## month_Apr               3.5661     4.2618   0.837    0.403    
## month_May               2.7986     4.2382   0.660    0.509    
## month_Jun               3.5484     4.2178   0.841    0.400    
## month_Jul              -2.6366     4.2421  -0.622    0.534    
## month_Aug              -1.6050     4.2167  -0.381    0.704    
## month_Sep              -2.0301     4.1152  -0.493    0.622    
## month_Oct              -1.0603     4.2323  -0.251    0.802    
## month_Nov               0.2852     4.2663   0.067    0.947    
## month_Dec              -0.3049     4.2419  -0.072    0.943    
## month_new                   NA         NA      NA       NA    
## quarter_Q2                  NA         NA      NA       NA    
## quarter_Q3                  NA         NA      NA       NA    
## quarter_Q4                  NA         NA      NA       NA    
## quarter_new                 NA         NA      NA       NA    
## half_year_H2                NA         NA      NA       NA    
## half_year_new               NA         NA      NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 83.9 on 760 degrees of freedom
## Multiple R-squared:  0.7228, Adjusted R-squared:  0.7173 
## F-statistic: 132.1 on 15 and 760 DF,  p-value: < 2.2e-16
## 
## [1] "--- Best tuning parameters for Lasso: lambda = 0.1"
## [1] "--- Best tuning parameters for Random Forest: mtry = 12"
## [1] "--- Best tuning parameters for XGBoost: ---"
##    nrounds max_depth eta gamma colsample_bytree min_child_weight subsample
## 18     500         9 0.1     0              0.8                1       0.8

Insights

1. RMSE on Test Set

The R code calculates the RMSE for each model’s predictions on the test_data. The results are as follows:

  • LM (Linear Model): RMSE = 76.54
  • Lasso: RMSE = 76.37
  • RF (Random Forest): RMSE = 28.54
  • XGB (XGBoost): RMSE = 26.77

Insight:

Consistent with the cross-validation results, the tree-based ensemble models, XGBoost and Random Forest, significantly outperform the linear models (LM and Lasso) on the test set.
XGBoost achieves the lowest RMSE (26.77), indicating it had the smallest prediction errors on average on the unseen test data.
Random Forest is a close second with an RMSE of 28.54.
Lasso Regression (RMSE 76.37) performs marginally better than the standard Linear Model (RMSE 76.54), but both have substantially higher errors than RF and XGBoost.
This confirms that the superior performance of XGBoost and Random Forest observed during cross-validation translates well to unseen data, suggesting good generalization.

2. Final Linear Model (LM) Details

The summary output for the finalModel of the Linear Regression (trained on the full training dataset) provides several details:

Coefficients:

  • The (Intercept) is 1244.03.
  • owner_ratio_national has a strong negative coefficient (-118.02) and is highly statistically significant (p < 2e-16), suggesting that as the national owner-occupied ratio decreases, the predicted Rent CPI increases.
  • price_rent_cpi_ratio has a positive coefficient (42.82) and is highly statistically significant (p < 2e-16), indicating that a higher price-to-rent CPI ratio is associated with a higher predicted Rent CPI.
  • rent_growth_monthly also has a positive coefficient (13.97) and is highly significant (p = 1.11e-05).
  • price_growth_monthly has a small negative coefficient (-2.25) but is not statistically significant (p = 0.507).
  • Most of the month_* dummy variables are not statistically significant, suggesting limited monthly seasonality after accounting for other factors.
  • Singularities: 7 coefficients (related to month_new, quarter_*, half_year_*) were not defined due to multicollinearity (perfect correlation between predictors), which is common when including multiple related time-based dummy variables.

Goodness-of-Fit (on training data):

  • Multiple R-squared: 0.7228 (The model explains about 72.3% of the variance in the Rent CPI on the training data).
  • Adjusted R-squared: 0.7173 (Adjusted for the number of predictors).
  • F-statistic: 132.1 on 15 and 760 DF, with a p-value < 2.2e-16, indicating that the overall model is statistically significant (at least some predictors are related to the outcome).

Insight:

The linear model identifies owner_ratio_national, price_rent_cpi_ratio, and rent_growth_monthly as significant predictors of Rent CPI in the training data. However, its overall predictive performance (R-squared on training data and RMSE on test data) is much weaker than the ensemble methods.

3. Best Tuning Parameters for Tuned Models

The output also provides the optimal hyperparameters found during the cross-validation tuning process for Lasso, Random Forest, and XGBoost:

Lasso:

  • Best lambda (regularization parameter) = 0.1

Random Forest (RF):

  • Best mtry (number of variables randomly sampled as candidates at each split) = 12

XGBoost (XGB):

  • nrounds = 500 (number of boosting rounds)
  • max_depth = 9 (maximum tree depth)
  • eta = 0.1 (learning rate)
  • gamma = 0 (minimum loss reduction required to make a further partition)
  • colsample_bytree = 0.8 (subsample ratio of columns when constructing each tree)
  • min_child_weight = 1 (minimum sum of instance weight needed in a child)
  • subsample = 0.8 (subsample ratio of the training instances)

Insight:

These parameters represent the configurations that yielded the best performance for each respective model during the cross-validation tuning phase. These are the configurations used for the final model training and subsequent evaluation on the test set.

Overall Conclusion from Test Set Evaluation:

The test set evaluation confirms the hierarchy of model performance established during cross-validation:

  • XGBoost is the top-performing model with the lowest RMSE on unseen data.
  • Random Forest follows closely as the second-best model.
  • Lasso Regression and Linear Model show considerably weaker predictive accuracy on the test set.

This robust evaluation on a separate test set provides strong evidence that XGBoost, with its tuned hyperparameters, is the most suitable model among those tested for predicting Rent CPI in this context, offering the best generalization to new data.

6.6. Visualising Model Performance: Actual vs. Predicted Plots

While metrics like RMSE give me a good single number to compare models, it is also really useful to see how well the predictions line up with the actual values. For my best performing models (Random Forest and XGBoost, based on the test set RMSE), I decided to create a couple of diagnostic plots:

  1. Actual vs. Predicted Scatter Plot: This plot puts the actual rent_cpi values from the test set on one axis and the model’s predicted values on the other. If the model is making good predictions, the points should cluster tightly around a diagonal line where Actual = Predicted.
  2. Residuals Over Time Plot: Residuals are just the differences between the actual values and the predicted values (Actual - Predicted). Plotting these over time, possibly faceted by region, can help me see if the model has any systematic errors – for example, if it consistently over-predicts or under-predicts during certain periods or for specific regions. Ideally, the residuals should look like random noise centred around zero.

These visual checks can often reveal patterns or issues that a single error metric might not capture.

# This chunk assumes 'model_rf', 'model_xgb' and 'test_data' exist from previous chunks.

# Plotting for Random Forest, which was one of my top models.
if (exists("model_rf") && !is.null(model_rf) && 
    exists("test_data") && !is.null(test_data) && nrow(test_data) > 0) { 
      
  print("Plotting Random Forest: Actual vs Predicted, and Residuals, on Test Set...")
  # Making predictions with the Random Forest model on the test data.
  final_preds_rf <- predict(model_rf, newdata = test_data)
  
  # Creating a dataframe for easy plotting.
  plot_df_rf <- data.frame(
    Actual = test_data$rent_cpi,
    Predicted = as.numeric(final_preds_rf), # Ensuring predictions are numeric.
    Date = test_data$date,      # Including Date for the residual plot over time.
    Region = test_data$region    # Including Region for faceting the residual plot.
  )
  
  # --- Actual vs. Predicted Scatter Plot for Random Forest ---
  pred_vs_actual_plot_rf <- ggplot(plot_df_rf, aes(x = Actual, y = Predicted)) +
    geom_point(alpha = 0.6, colour = "forestgreen", size = 1.5) + 
    # Adding a y=x line for perfect agreement reference.
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "black", linewidth = 0.8) +
    labs(
      title = "Actual vs. Predicted Rent CPI (Random Forest - Test Set)",
      x = "Actual Rent CPI", 
      y = "Predicted Rent CPI"
    ) +
    coord_fixed() # Ensures a 1:1 aspect ratio, making it easier to judge predictions against the diagonal.
  print(pred_vs_actual_plot_rf)

  # --- Residuals Over Time Plot for Random Forest ---
  # Calculating residuals (Actual - Predicted).
  plot_df_residuals_rf <- plot_df_rf %>% mutate(Residuals = Actual - Predicted)
  
  residuals_time_plot_rf <- ggplot(plot_df_residuals_rf, aes(x = Date, y = Residuals)) +
    geom_line(alpha = 0.7, colour = "purple4") + 
    # Horizontal line at zero for reference (perfect prediction would have residuals of 0).
    geom_hline(yintercept = 0, linetype = "dashed", colour = "black") +
    # Faceting by region can show if the model struggles more with certain areas.
    # Adjusting ncol to be at most 3, or fewer if fewer unique regions in test_data.
    facet_wrap(~Region, scales = "free_y", ncol=min(3, length(unique(plot_df_residuals_rf$Region)))) + 
    labs(
      title = "Residuals Over Time (Random Forest - Test Set)",
      x = "Date", 
      y = "Residuals (Actual - Predicted)"
    ) +
    theme(strip.text = element_text(size=rel(0.8))) # Adjusting facet label size.
  print(residuals_time_plot_rf)
  
} else {
  print("Random Forest model ('model_rf') or test_data not available. Skipping its actual vs. predicted plots.")
}
## [1] "Plotting Random Forest: Actual vs Predicted, and Residuals, on Test Set..."

# Optional: Plotting for XGBoost as well, given it was also a strong performer.
if (exists("model_xgb") && !is.null(model_xgb) && 
    exists("test_data") && !is.null(test_data) && nrow(test_data) > 0) {
      
  print("Plotting XGBoost Actual vs Predicted on Test Set...")
  final_preds_xgb <- predict(model_xgb, newdata = test_data)
  
  plot_df_xgb <- data.frame(
    Actual = test_data$rent_cpi, 
    Predicted = as.numeric(final_preds_xgb)
    # Can add Date and Region here too if I want to plot XGBoost residuals like for RF.
    # Date = test_data$date,
    # Region = test_data$region 
  )
  
  pred_vs_actual_plot_xgb <- ggplot(plot_df_xgb, aes(x = Actual, y = Predicted)) +
    geom_point(alpha = 0.5, colour = "firebrick") +
    geom_abline(intercept = 0, slope = 1, linetype = "dashed", colour = "black", linewidth = 0.8) +
    labs(
      title = "Actual vs. Predicted Rent CPI (XGBoost Model - Test Set)", 
      x = "Actual Rent CPI", 
      y = "Predicted Rent CPI"
    ) +
    coord_fixed()
  print(pred_vs_actual_plot_xgb)
  
  # Example of how I could do residuals for XGBoost if I wanted:
  # plot_df_residuals_xgb <- plot_df_xgb %>% mutate(Residuals = Actual - Predicted, Date = test_data$date, Region = test_data$region)
  # residuals_time_plot_xgb <- ggplot(plot_df_residuals_xgb, aes(x = Date, y = Residuals)) +
  #   geom_line(alpha = 0.7, colour = "orangered3") + 
  #   geom_hline(yintercept = 0, linetype = "dashed", colour = "black") +
  #   facet_wrap(~Region, scales = "free_y", ncol=min(3, length(unique(plot_df_residuals_xgb$Region)))) + 
  #   labs(title = "Residuals Over Time (XGBoost - Test Set)", x = "Date", y = "Residuals") +
  #   theme(strip.text = element_text(size=rel(0.8)))
  # print(residuals_time_plot_xgb)

} else {
   print("XGBoost model ('model_xgb') or test_data not available. Skipping its actual vs. predicted plots.")
}
## [1] "Plotting XGBoost Actual vs Predicted on Test Set..."

Insights

Insights Summary

1. Actual vs. Predicted Rent CPI – Random Forest (Test Set)

The scatter plot shows that Random Forest predictions align well with actual Rent CPI values, forming a tight cluster around the ideal line. The model handles the full CPI range (1000–1550) and has minimal bias. Slight deviations are seen at higher CPI levels (over 1400), where prediction errors increase mildly.
Insight: Random Forest predicts Rent CPI with high accuracy overall, showing good alignment with actual values and only minor issues at higher price levels.

2. Residuals Over Time by Region – Random Forest (Test Set)

The residuals generally fluctuate around zero, indicating that the model doesn’t suffer from major systematic bias. However, there are differences across regions and time:
- Auckland: Residuals are fairly stable and mostly within ±25 CPI units.
- Canterbury: More volatile residuals, with under-prediction around 2012–2014 and over-prediction around 2017–2018.
- National: Slight under-prediction trend in later years (2016–2019).
- Rest of North Island: Some under-prediction spikes around 2019–2020.
- Rest of South Island: Noticeable volatility, including consistent under-prediction (2011–2012) and a strong spike in 2020.
- Wellington: Earlier predictions are accurate, but from 2014 onward there’s a visible shift — first over-prediction, then under-prediction.

Insight: While Random Forest performs well overall, residual behavior varies by region and time. Certain regions—especially Canterbury, Rest of South Island, and Wellington—show patterns of systematic over- or under-prediction during specific periods, indicating uneven regional accuracy.

3. Actual vs. Predicted Rent CPI – XGBoost (Test Set)

XGBoost predictions closely track actual CPI values, forming a tight and consistent pattern along the diagonal. The spread and accuracy are comparable to Random Forest but slightly better in RMSE.

Insight: XGBoost matches Random Forest in visual predictive performance and slightly exceeds it in precision. The model shows strong generalization and robustness with minimal bias across the full CPI range.

Overall Insight:

Both Random Forest and XGBoost deliver high predictive accuracy on unseen test data, with XGBoost marginally outperforming. However, regional and temporal residual analysis reveals that model performance isn’t uniform. Some regions and time periods show deviations, particularly for Random Forest. This suggests potential areas for future model refinement, especially for capturing regional dynamics more accurately.

7. Time Series Forecasting of Regional Rent CPI

After building and evaluating my regression models for predicting Rent CPI, I wanted to explore more traditional time series forecasting techniques as well. My goal here is to predict Rent CPI out to the end of 2030. I decided to try two main approaches for a selected region (I will use “Auckland” as the primary example in this report, but my R script is set up to loop through other regions too if I change the unique_regions_to_forecast variable):

  1. Direct ARIMA Modelling: Applying an ARIMA model directly to the historical Rent CPI series for a region.
  2. Recursive Forecasting with Machine Learning Models: Using my best regression model (Random Forest) in a recursive manner, which involves forecasting its predictors first (specifically, using ARIMA to forecast median_price).

First, I need to set up some general parameters for the forecasting horizon and ensure my previously trained models and recipes are ready if I need them.

# --- General Setup for My Forecasting Experiments ---

unique_regions_to_forecast <- unique(full_data$region)

# Defining how far out I want to forecast.
last_historical_date <- max(full_data$date, na.rm = TRUE)
forecast_horizon_start_date <- last_historical_date %m+% months(1) # Starting from the month after my last data point.
forecast_horizon_end_date <- make_date(2030, 12, 1) # My target end date.
future_dates_for_forecast <- seq(from = forecast_horizon_start_date, to = forecast_horizon_end_date, by = "month")
num_forecast_periods <- length(future_dates_for_forecast) # How many months this is.

print(paste("--- Initialising Forecasting for Region(s):", paste(unique_regions_to_forecast, collapse=", "), "---"))
## [1] "--- Initialising Forecasting for Region(s): Auckland, Wellington, Rest of North Island, Canterbury, Rest of South Island, National ---"
print(paste("Forecasting for", num_forecast_periods, "months, from", 
            format(forecast_horizon_start_date, "%Y-%m"), "to", format(forecast_horizon_end_date, "%Y-%m")))
## [1] "Forecasting for 126 months, from 2020-07 to 2030-12"
# Making sure my prepped recipe (from the ML section) is available, as I will need it for the Random Forest forecast.
if (!exists("prepped_recipe_for_models") || !"recipe" %in% class(prepped_recipe_for_models) || is.null(prepped_recipe_for_models$tr_info)) {
  # This check ensures that if I run this section independently, the recipe is prepped.
  # It assumes 'rec' and 'train_data' are loaded from the ML section.
  if(exists("rec") && exists("train_data")){
    print("Note: 'prepped_recipe_for_models' was not found globally, re-prepping from 'rec' and 'train_data'.")
    prepped_recipe_for_models <- prep(rec, training = train_data)
  } else {
    stop("The recipe ('rec') and 'train_data' are not available to prep for forecasting. Please ensure the ML section has run.")
  }
}

# Making sure my Random Forest model is loaded and ready.
if (!exists("model_rf") || !("train" %in% class(model_rf))) {
  stop("My Random Forest model ('model_rf') is not here. I need to run the ML training part first for RF forecasting.")
}
active_forecasting_model_rf <- model_rf # This is the RF model I plan to use for one of the methods.

# For the recursive RF forecast, I'll need national dwelling ratios.
# I am just going to hold these constant at their last known values for simplicity in these forecasts.
last_national_ratios_fc <- full_data %>%
  arrange(desc(date)) %>%
  # distinct() on these columns should give one row if they are truly national and consistent per date.
  distinct(rented_ratio_national, owner_ratio_national, free_ratio_national) %>%
  dplyr::slice(1) %>% # Taking the most recent set of distinct ratios.
  select(rented_ratio_national, owner_ratio_national, free_ratio_national)

if(nrow(last_national_ratios_fc) == 0 || any(is.na(last_national_ratios_fc))){
    print("Warning: Couldn't get complete last national ratios for forecasting. Using placeholder values for RF method.")
    last_national_ratios_fc <- data.frame(rented_ratio_national=0.3, owner_ratio_national=0.6, free_ratio_national=0.1)
}

# Initialising lists to store all forecasts from different methods and regions if I loop later.
all_regional_arima_cpi_forecasts <- list()
all_regional_rf_cpi_forecasts <- list()

7.1. Method 1: Direct ARIMA Forecasting for Regional Rent CPI

For this first forecasting method, I am using the auto.arima() function from the forecast package. This function is pretty handy as it automatically tries to select the best ARIMA(p,d,q)(P,D,Q)[m] model based on information criteria (like AICc). This approach just looks at the past values of the Rent CPI for a specific region to make its forecasts. It does not use any of the other predictors from my full_data table.

My process for each region (just “Auckland” in this R Markdown output for now, as defined in my unique_regions_to_forecast variable in the setup chunk) involves: 1. Extracting the historical Rent CPI for the chosen region. 2. Converting it to a time series (ts) object, which is what auto.arima needs. 3. Fitting the auto.arima model, allowing it to search for seasonality, drift, and apply a Box-Cox transformation if it helps. 4. Checking the residuals of the fitted model to make sure it is a decent fit (i.e., residuals should ideally look like white noise). 5. Generating the forecast for the desired number of periods (up to the end of 2030). 6. Plotting the forecast, including the 80% and 95% prediction intervals to show the uncertainty.

# This R chunk now contains its own loop for Method 1.
# It will iterate through each region in 'unique_regions_to_forecast'.

# Initialising a list to store ARIMA forecast objects if I run for multiple regions
# Note: all_regional_arima_cpi_forecasts was already initialised in the general setup chunk.
# if (!exists("all_regional_arima_cpi_forecasts")) all_regional_arima_cpi_forecasts <- list()

for (current_forecast_region in unique_regions_to_forecast) {
  
  print(paste0("\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: ", current_forecast_region, " ==="))
  
  regional_rent_cpi_history <- full_data %>%
    filter(region == current_forecast_region & !is.na(rent_cpi)) %>%
    select(date, rent_cpi) %>%
    arrange(date) 
  
  regional_arima_cpi_forecast_obj_loop <- NULL 

  if(nrow(regional_rent_cpi_history) < 36) { 
    print(paste("Not enough historical Rent CPI data for ARIMA for region:", current_forecast_region, "- Skipping."))
    all_regional_arima_cpi_forecasts[[current_forecast_region]] <- NULL 
    next # Move to the next region in the loop
  }
  
  start_year_arima_cpi <- lubridate::year(min(regional_rent_cpi_history$date))
  start_month_arima_cpi <- lubridate::month(min(regional_rent_cpi_history$date))
  ts_regional_cpi_direct <- ts(regional_rent_cpi_history$rent_cpi, 
                               start = c(start_year_arima_cpi, start_month_arima_cpi), 
                               frequency = 12) 
  
  plot_ts_object <- autoplot(ts_regional_cpi_direct) +
        ggtitle(paste("Monthly Rent CPI -", current_forecast_region)) +
        xlab("Year") + ylab("Rent CPI") +
        theme(plot.title = element_text(hjust = 0.5)) 
  print(plot_ts_object)
  
  print(paste("Fitting auto.arima for Rent CPI in", current_forecast_region, "..."))
  regional_arima_cpi_model_direct <- NULL 
  tryCatch({
    regional_arima_cpi_model_direct <- forecast::auto.arima(ts_regional_cpi_direct, 
                                         seasonal = TRUE, 
                                         stepwise = FALSE, approximation = FALSE, 
                                         allowdrift = TRUE, 
                                         lambda = "auto") 
    print(paste("ARIMA model fitted for Rent CPI in", current_forecast_region, ":", regional_arima_cpi_model_direct$method))
    print(summary(regional_arima_cpi_model_direct))
    
    print(paste("Residual diagnostics for Rent CPI ARIMA model of", current_forecast_region, ":"))
    forecast::checkresiduals(regional_arima_cpi_model_direct) 
    
    regional_arima_cpi_forecast_obj_loop <- forecast::forecast(regional_arima_cpi_model_direct, h = num_forecast_periods)
    
    all_regional_arima_cpi_forecasts[[current_forecast_region]] <- tibble(
        date = future_dates_for_forecast, 
        region = current_forecast_region,
        rent_cpi_forecast_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$mean),
        lower80_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$lower[,1]),
        upper80_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$upper[,1]),
        lower95_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$lower[,2]),
        upper95_arima = as.numeric(regional_arima_cpi_forecast_obj_loop$upper[,2])
    )
    
    plot_arima_direct_fc <- autoplot(regional_arima_cpi_forecast_obj_loop) +
      ggtitle(paste0(num_forecast_periods/12, "-Year Rent CPI Forecast for ", current_forecast_region, " (ARIMA)"),
              subtitle = paste("Model:", regional_arima_cpi_model_direct$method)) +
      xlab("Year") + ylab("Rent CPI Forecast") +
      guides(fill=guide_legend(title="Prediction Interval:")) +
      theme(plot.title = element_text(hjust = 0.5), plot.subtitle = element_text(hjust = 0.5)) 
    print(plot_arima_direct_fc)
    
    print(paste("ARIMA Forecasted values (first 6 months) for Rent CPI in", current_forecast_region, ":"))
    if(!is.null(regional_arima_cpi_forecast_obj_loop)) print(head(regional_arima_cpi_forecast_obj_loop$mean))
    
  }, error = function(e_arima_cpi) {
    print(paste("Direct ARIMA modelling for Rent CPI failed for", current_forecast_region, ":", e_arima_cpi$message))
    all_regional_arima_cpi_forecasts[[current_forecast_region]] <- NULL 
  })
} # End of the 'for' loop for current_forecast_region (for Method 1)
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Auckland ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "Fitting auto.arima for Rent CPI in Auckland ..."
## [1] "ARIMA model fitted for Rent CPI in Auckland : "
## Series: ts_regional_cpi_direct 
## ARIMA(2,1,0)(1,0,0)[12] with drift 
## Box Cox transformation: lambda= 0.789686 
## 
## Coefficients:
##           ar1      ar2    sar1   drift
##       -0.3077  -0.2904  0.2328  0.6876
## s.e.   0.0773   0.0774  0.0862  0.1497
## 
## sigma^2 = 5.766:  log likelihood = -367.91
## AIC=745.81   AICc=746.2   BIC=761.22
## 
## Training set error measures:
##                       ME     RMSE      MAE          MPE      MAPE      MASE       ACF1
## Training set -0.02772669 10.68066 8.632323 -0.001775353 0.6794153 0.2126537 -0.0244276
## [1] "Residual diagnostics for Rent CPI ARIMA model of Auckland :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0)(1,0,0)[12] with drift
## Q* = 22.651, df = 21, p-value = 0.3629
## 
## Model df: 3.   Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Auckland :"
##           Jul      Aug      Sep      Oct      Nov      Dec
## 2020 1513.793 1518.572 1515.413 1519.628 1530.212 1532.353
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Wellington ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "Fitting auto.arima for Rent CPI in Wellington ..."
## [1] "ARIMA model fitted for Rent CPI in Wellington : "
## Series: ts_regional_cpi_direct 
## ARIMA(0,1,1)(0,1,2)[12] 
## Box Cox transformation: lambda= -0.8999268 
## 
## Coefficients:
##           ma1     sma1     sma2
##       -0.4992  -0.5755  -0.1749
## s.e.   0.0692   0.1020   0.0987
## 
## sigma^2 = 1.1e-07:  log likelihood = 1385.71
## AIC=-2763.41   AICc=-2763.14   BIC=-2751.4
## 
## Training set error measures:
##                    ME     RMSE      MAE        MPE     MAPE      MASE      ACF1
## Training set 0.773162 46.95124 20.49915 0.08885825 1.740393 0.4659603 0.1495763
## [1] "Residual diagnostics for Rent CPI ARIMA model of Wellington :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(0,1,2)[12]
## Q* = 11.565, df = 21, p-value = 0.9506
## 
## Model df: 3.   Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Wellington :"
##           Jul      Aug      Sep      Oct      Nov      Dec
## 2020 1553.909 1566.575 1569.266 1580.762 1609.558 1594.449
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Rest of North Island ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "Fitting auto.arima for Rent CPI in Rest of North Island ..."
## [1] "ARIMA model fitted for Rent CPI in Rest of North Island : "
## Series: ts_regional_cpi_direct 
## ARIMA(2,1,0) with drift 
## Box Cox transformation: lambda= -0.8999268 
## 
## Coefficients:
##           ar1      ar2  drift
##       -0.4405  -0.1913  0e+00
## s.e.   0.0872   0.0814  1e-04
## 
## sigma^2 = 7.95e-09:  log likelihood = 1585.65
## AIC=-3163.31   AICc=-3163.05   BIC=-3150.98
## 
## Training set error measures:
##                    ME     RMSE      MAE       MPE      MAPE     MASE       ACF1
## Training set 2.716329 30.16879 9.516836 0.2392734 0.8159561 0.223016 0.02392561
## [1] "Residual diagnostics for Rent CPI ARIMA model of Rest of North Island :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(2,1,0) with drift
## Q* = 0.81089, df = 22, p-value = 1
## 
## Model df: 2.   Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Rest of North Island :"
##           Jul      Aug      Sep      Oct      Nov      Dec
## 2020 1618.724 1623.700 1630.576 1636.436 1642.448 1648.660
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Canterbury ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "Fitting auto.arima for Rent CPI in Canterbury ..."
## [1] "ARIMA model fitted for Rent CPI in Canterbury : "
## Series: ts_regional_cpi_direct 
## ARIMA(1,1,0)(1,0,0)[12] with drift 
## Box Cox transformation: lambda= 0.9358105 
## 
## Coefficients:
##           ar1    sar1  drift
##       -0.4158  0.3046  1.430
## s.e.   0.0733  0.0812  0.885
## 
## sigma^2 = 132.2:  log likelihood = -620.81
## AIC=1249.61   AICc=1249.87   BIC=1261.94
## 
## Training set error measures:
##                       ME     RMSE      MAE          MPE     MAPE      MASE       ACF1
## Training set -0.07469888 17.99377 13.75772 -0.007292928 1.086328 0.2930917 -0.0247203
## [1] "Residual diagnostics for Rent CPI ARIMA model of Canterbury :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,0)(1,0,0)[12] with drift
## Q* = 30.209, df = 22, p-value = 0.1135
## 
## Model df: 2.   Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Canterbury :"
##           Jul      Aug      Sep      Oct      Nov      Dec
## 2020 1388.796 1391.117 1401.172 1386.604 1407.331 1409.722
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: Rest of South Island ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "Fitting auto.arima for Rent CPI in Rest of South Island ..."
## [1] "ARIMA model fitted for Rent CPI in Rest of South Island : "
## Series: ts_regional_cpi_direct 
## ARIMA(0,1,1)(1,0,0)[12] 
## Box Cox transformation: lambda= -0.6768916 
## 
## Coefficients:
##           ma1    sar1
##       -0.6015  0.6269
## s.e.   0.0719  0.0604
## 
## sigma^2 = 4.287e-08:  log likelihood = 1165.43
## AIC=-2324.86   AICc=-2324.71   BIC=-2315.62
## 
## Training set error measures:
##                    ME    RMSE      MAE       MPE     MAPE      MASE       ACF1
## Training set 2.236949 29.5682 20.21368 0.1786773 1.658088 0.5005038 0.01989861
## [1] "Residual diagnostics for Rent CPI ARIMA model of Rest of South Island :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(1,0,0)[12]
## Q* = 13.782, df = 22, p-value = 0.909
## 
## Model df: 2.   Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in Rest of South Island :"
##           Jul      Aug      Sep      Oct      Nov      Dec
## 2020 1415.177 1418.078 1431.342 1412.851 1424.438 1409.355
## [1] "\n === METHOD 1: Processing Direct ARIMA for Rent CPI in Region: National ==="
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "Fitting auto.arima for Rent CPI in National ..."
## [1] "ARIMA model fitted for Rent CPI in National : "
## Series: ts_regional_cpi_direct 
## ARIMA(0,1,1)(1,0,0)[12] with drift 
## Box Cox transformation: lambda= -0.101685 
## 
## Coefficients:
##           ma1    sar1   drift
##       -0.2623  0.5669  0.0011
## s.e.   0.0761  0.0674  0.0004
## 
## sigma^2 = 8.944e-06:  log likelihood = 707.9
## AIC=-1407.79   AICc=-1407.54   BIC=-1395.47
## 
## Training set error measures:
##                      ME     RMSE     MAE          MPE      MAPE      MASE         ACF1
## Training set -0.1265951 7.771131 5.86455 -0.008092995 0.4700289 0.1529883 -0.006882079
## [1] "Residual diagnostics for Rent CPI ARIMA model of National :"
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(0,1,1)(1,0,0)[12] with drift
## Q* = 19.179, df = 22, p-value = 0.6342
## 
## Model df: 2.   Total lags used: 24
Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

Plot 7.1: Direct ARIMA Forecasts for Rent CPI (for selected regions). Caption will update per plot if multiple regions run.

## [1] "ARIMA Forecasted values (first 6 months) for Rent CPI in National :"
##           Jul      Aug      Sep      Oct      Nov      Dec
## 2020 1502.499 1510.330 1514.173 1510.031 1532.657 1528.003

Insights


ARIMA Model & Residual Diagnostics

  • Auckland (ARIMA(2,1,0)(1,0,0)[12]): Residuals appear random; minor ACF spike near lag 9/10; bell-shaped histogram → good fit.
  • Canterbury (ARIMA(1,1,0)(1,0,0)[12]): Residuals show some autocorrelation (notably at lag 12); suggests some seasonal effects not captured.
  • National (ARIMA(0,1,1)(1,0,0)[12]): Residuals mostly random; borderline spike at lag 12 → generally good fit.
  • Rest of North Island (ARIMA(2,1,0)): Residuals resemble white noise; ACF flat → excellent statistical fit.
  • Rest of South Island (ARIMA(0,1,1)(1,0,0)[12]): Significant ACF spikes at lags 12 & 24 → seasonality not fully captured.
  • Wellington (ARIMA(0,1,1)(0,1,2)[12]): Residuals highly random and near-zero; ACF clean → strong fit.

Forecast Plot & Regional Insights (to 2030)

  • Auckland: Forecast rises to ~1900–2000; uncertainty increases over time.
    Insight: Captures strong trend; future values are uncertain but growth is expected.

  • Canterbury: Modest rise to ~1600–1700; wide prediction intervals.
    Insight: Forecast suggests slow recovery; seasonal effects may be under-modeled.

  • National: Consistent growth forecast toward ~2000; expanding uncertainty.
    Insight: Forecast aligns with historical trend; generally reliable.

  • Rest of North Island: Forecast surges beyond 2800; aggressive trend continuation.
    Insight: Very steep forecast; reflects trend but lacks seasonality—use caution.

  • Rest of South Island: Forecast remains flat at ~1300–1350; wide intervals.
    Insight: Recent drop heavily influences trend; forecast highly uncertain and potentially unreliable.

  • Wellington: Forecast exceeds 2500; captures cyclical trend with good fit.
    Insight: Strong growth projection with seasonality modeled; solid statistical foundation.


Overall Comparative Summary (Method 1: Direct ARIMA)

  • Trend Dominance: Regions like Auckland, National, Rest of North Island, and Wellington show strong historical and projected Rent CPI growth.
  • Model Fit: Statistical fits are strong for most (esp. Auckland, Wellington, Rest of North Island), but weaker for Canterbury and Rest of South Island due to residual seasonality.
  • Forecast Divergence:
    • Strong growth: Auckland, National, Rest of North Island, Wellington.
    • Modest growth/recovery: Canterbury.
    • Flat/highly uncertain: Rest of South Island (due to volatility and sharp recent drop).

7.2. Method 2: Recursive Random Forest Forecast for Regional Rent CPI

My second approach to forecasting Rent CPI uses the Random Forest model (model_rf) that performed well in the machine learning section. This method is a bit more complex than the direct ARIMA on Rent CPI because my Random Forest model relies on several other predictor variables. To use it for forecasting, I need to have future values for all these predictors.

The main steps for this method, for each region I am analysing (just “Auckland” for this R Markdown output, as defined in unique_regions_to_forecast), are:

  1. Forecast Key Exogenous Predictors: The most important predictor that itself changes significantly over time is median_price. So, for the current region, I will first create an ARIMA forecast for its median_price out to 2030. I decided to use a log transformation (lambda = 0) for this median_price forecast, as that often helps produce more stable and sensible long-term trends for price data. Other predictors, like the national dwelling ratios, I will hold constant at their last known values for this forecasting exercise as a simplification.

  2. Recursive Prediction for Rent CPI: Once I have the future path for median_price (and assumptions for other external predictors), I will predict rent_cpi one month at a time using my trained Random Forest model:

    • For the first future month, I will use the last known actual values from my historical data to help create any lagged features the model needs.
    • I will then predict rent_cpi for that month.
    • This predicted rent_cpi will then be used to help create the lagged features for the next month’s prediction, and so on. This step-by-step process is why it is called a “recursive” forecast.
  3. Approximation for price_rent_cpi_ratio: One of the features my Random Forest model was trained on is price_rent_cpi_ratio = median_price / rent_cpi. When I am forecasting for a future month, I do not yet know the rent_cpi for that exact month to calculate this ratio. So, for the purpose of creating inputs for the forecast, I will approximate this feature by using the previous month’s rent_cpi (which will be either the last known actual value or the value I just forecasted in the previous step) in the denominator: price_rent_cpi_ratio_approx = median_price_future / rent_cpi_previous_month. This is a workaround for recursive forecasting when such ratios are used in the model. It is an approximation, and for an even more robust approach in future work, I might consider retraining my Random Forest model with a feature that is already based on lagged CPI, like median_price / lag(rent_cpi, 1).

Now, here is the R code that implements this recursive Random Forest forecasting method.

# This R chunk contains its own loop for Method 2 (Recursive RF).
# It will iterate through each region in 'unique_regions_to_forecast'.

for (current_forecast_region in unique_regions_to_forecast) {
  
  print(paste0("\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in ", current_forecast_region, " ==="))

  # --- Step 2a: Forecast 'median_price' for current_forecast_region using ARIMA ---
  # This forecasted median_price will be an input to my RF model for Rent CPI.
  print(paste("--- Sub-step for Method 2: Forecasting median_price for", current_forecast_region, "using ARIMA ---"))
  
  historical_median_price_data_rf <- full_data %>%
    filter(region == current_forecast_region & !is.na(median_price)) %>%
    select(date, median_price) %>% 
    arrange(date)

  future_median_prices_for_rf_loop <- NULL # Initialise this for each region within the loop

  if(nrow(historical_median_price_data_rf) < 24) { # Need at least a couple of years for a decent ARIMA
    print(paste("Warning: Not enough historical median_price data for", current_forecast_region, 
                "to fit ARIMA. Holding last known price constant for RF input."))
    if(nrow(historical_median_price_data_rf) > 0) {
      future_median_prices_for_rf_loop <- rep(tail(historical_median_price_data_rf$median_price, 1), num_forecast_periods)
    } else {
      # If absolutely no historical median price, RF forecasting for this region will be problematic.
      future_median_prices_for_rf_loop <- rep(NA, num_forecast_periods) 
      print(paste("CRITICAL WARNING: No median price data at all for", current_forecast_region,
                  ". RF forecast's median_price input will be NA, which will likely cause issues."))
    }
  } else {
    median_price_ts_rf <- ts(historical_median_price_data_rf$median_price, 
                             start = c(year(min(historical_median_price_data_rf$date)), 
                                       month(min(historical_median_price_data_rf$date))), 
                             frequency = 12)
    
    arima_median_price_model_rf_loop <- NULL # Initialise model object for this region
    tryCatch({
      # Using lambda=0 (log transform) for median_price as discussed, 
      # to try and get a more sensible long-term trend for this volatile series.
      arima_median_price_model_rf_loop <- forecast::auto.arima(median_price_ts_rf, 
                                               seasonal = TRUE, stepwise = TRUE, approximation = FALSE, 
                                               allowdrift = TRUE, lambda = 0) 
      future_median_prices_forecast_obj_rf_loop <- forecast::forecast(arima_median_price_model_rf_loop, h = num_forecast_periods)
      future_median_prices_for_rf_loop <- as.numeric(future_median_prices_forecast_obj_rf_loop$mean)
      
      print(paste("ARIMA for median_price (input for RF) in", current_forecast_region, "fitted:", arima_median_price_model_rf_loop$method))
      
      # I can plot this helper forecast to see what median_price projection is being used.
      plot_median_price_fc_rf <- autoplot(future_median_prices_forecast_obj_rf_loop) + 
      ggtitle(paste("Helper: Median Price ARIMA Forecast for", current_forecast_region,"(for RF)"),
      subtitle = paste("Model:", arima_median_price_model_rf_loop$method)) +
      scale_y_continuous(labels = dollar_format(prefix="$", scale=1/1000, suffix="K")) +
      xlab("Year") + ylab("Median Price Forecast")
      print(plot_median_price_fc_rf)
              
    }, error = function(e_arima_mp) {
      print(paste("ARIMA for median_price input (for RF) failed for", current_forecast_region, ":", e_arima_mp$message))
      # Fallback if ARIMA for median price fails.
      if(nrow(historical_median_price_data_rf) > 0) {
          # Using <<- to assign to variable in parent environment of the tryCatch error function.
          future_median_prices_for_rf_loop <<- rep(tail(historical_median_price_data_rf$median_price, 1), num_forecast_periods)
      } else {
          future_median_prices_for_rf_loop <<- rep(NA, num_forecast_periods)
      }
    })
  }
  
  # Proceed with RF forecast for this region only if I have future median prices.
  if(is.null(future_median_prices_for_rf_loop) || all(is.na(future_median_prices_for_rf_loop))) {
      print(paste("Cannot proceed with RF forecast for", current_forecast_region, "due to missing/failed future median price forecast."))
      all_regional_rf_cpi_forecasts[[current_forecast_region]] <- NULL # Store NULL if skipped
  } else {
      # --- 2b. Prepare Last Known Actual Data for current_forecast_region for RF ---
      # This is needed to bootstrap the recursive process (lags for the first forecast step).
      last_observation_rf_startup <- full_data %>%
        filter(region == current_forecast_region) %>% 
        filter(date == max(date, na.rm = TRUE)) %>%
        # Selecting columns that match ml_data_final structure, plus original lags from full_data.
        select(any_of(c(colnames(ml_data_final), "rent_cpi_lag", "median_price_lag", "price_growth_monthly", "rent_growth_monthly"))) %>% 
        # Re-ensure date components are present as they were in ml_data_final (used by recipe).
        mutate(
          year = lubridate::year(date), 
          month = lubridate::month(date), 
          quarter = lubridate::quarter(date), 
          half_year = ifelse(month %in% 1:6, 1, 2)
        )

      if(nrow(last_observation_rf_startup) == 0) {
        print(paste("No final historical data row found for RF for", current_forecast_region, ". Skipping RF forecast for this region."))
        all_regional_rf_cpi_forecasts[[current_forecast_region]] <- NULL
      } else {
        # Initialise values for the loop based on the true last observation (time t)
        rent_cpi_current_is_previous_for_next <- last_observation_rf_startup$rent_cpi 
        # and the observation before that (time t-1) for growth calculations
        rent_cpi_previous_is_two_ago_for_next <- last_observation_rf_startup$rent_cpi_lag 
        median_price_current_is_previous_for_next <- last_observation_rf_startup$median_price

        rent_cpi_forecasts_rf_list_region <- list() # To store this region's RF forecasts

        for (i in 1:num_forecast_periods) {
          current_future_date_rf <- future_dates_for_forecast[i] # This is the date for t+i
          
          # --- Constructing features for this future month (t+i) ---
          year_fut_rf = lubridate::year(current_future_date_rf)
          month_fut_rf = lubridate::month(current_future_date_rf) # Will be factorised by recipe
          quarter_fut_rf = lubridate::quarter(current_future_date_rf) # Will be factorised by recipe
          half_year_fut_rf = ifelse(month_fut_rf %in% 1:6, 1, 2) # Will be factorised by recipe
          
          median_price_fut_rf = future_median_prices_for_rf_loop[i] # This is the ARIMA-forecasted median_price for t+i
          log_median_price_fut_rf = ifelse(!is.na(median_price_fut_rf) & median_price_fut_rf > 0, log(median_price_fut_rf), NA)
          
          # Price growth for t+i: (price(t+i) - price(t+i-1)) / price(t+i-1)
          # price(t+i-1) is median_price_current_is_previous_for_next
          price_growth_monthly_fut_rf <- ifelse(!is.na(median_price_current_is_previous_for_next) & median_price_current_is_previous_for_next != 0 & !is.na(median_price_fut_rf), 
                                                (median_price_fut_rf - median_price_current_is_previous_for_next) / median_price_current_is_previous_for_next, NA)
          # Fallback for the very first forecast step if the immediate lag needed for growth was NA from history
          if (i == 1 && is.na(price_growth_monthly_fut_rf) && !is.na(last_observation_rf_startup$price_growth_monthly)) {
              price_growth_monthly_fut_rf <- last_observation_rf_startup$price_growth_monthly 
          }
                                                
          # National dwelling ratios (held constant as per last_national_ratios_fc)
          rented_ratio_national_fut_rf <- last_national_ratios_fc$rented_ratio_national
          owner_ratio_national_fut_rf <- last_national_ratios_fc$owner_ratio_national
          free_ratio_national_fut_rf <- last_national_ratios_fc$free_ratio_national
          
          # Rent growth for t+i: (rent_cpi(t+i-1) - rent_cpi(t+i-2)) / rent_cpi(t+i-2)
          # rent_cpi(t+i-1) is rent_cpi_current_is_previous_for_next
          # rent_cpi(t+i-2) is rent_cpi_previous_is_two_ago_for_next
          rent_growth_monthly_fut_rf <- ifelse(!is.na(rent_cpi_current_is_previous_for_next) & !is.na(rent_cpi_previous_is_two_ago_for_next) & rent_cpi_previous_is_two_ago_for_next != 0, 
                                           (rent_cpi_current_is_previous_for_next - rent_cpi_previous_is_two_ago_for_next) / rent_cpi_previous_is_two_ago_for_next, NA)
          # Fallback for the very first forecast step
          if (i == 1 && is.na(rent_growth_monthly_fut_rf) && !is.na(last_observation_rf_startup$rent_growth_monthly) ) {
              rent_growth_monthly_fut_rf <- last_observation_rf_startup$rent_growth_monthly 
          }

          # Approximated price_rent_cpi_ratio using t-1 rent_cpi (rent_cpi_current_is_previous_for_next)
          # My model was trained with contemporaneous rent_cpi in this ratio. This is an approximation for forecasting.
          price_rent_cpi_ratio_fut_approx_rf <- ifelse(rent_cpi_current_is_previous_for_next > 0 & !is.na(median_price_fut_rf), 
                                                    median_price_fut_rf / rent_cpi_current_is_previous_for_next, NA)

          # Assembling the feature row. Must match the structure the recipe expects.
          future_row_for_rf_predict <- tibble(
            date = current_future_date_rf, # For recipe's ID role (original date)
            region = current_forecast_region, # For recipe's ID role
            # Core predictors that were in ml_feature_cols
            median_price = median_price_fut_rf,
            rented_ratio_national = rented_ratio_national_fut_rf,
            owner_ratio_national = owner_ratio_national_fut_rf,
            free_ratio_national = free_ratio_national_fut_rf,
            price_rent_cpi_ratio = price_rent_cpi_ratio_fut_approx_rf, 
            log_median_price = log_median_price_fut_rf,
            price_growth_monthly = price_growth_monthly_fut_rf,
            rent_growth_monthly = rent_growth_monthly_fut_rf,
            # Date components that the recipe will factorise and dummy
            year = year_fut_rf,
            month = month_fut_rf, 
            quarter = quarter_fut_rf,
            half_year = half_year_fut_rf
            # rent_cpi (outcome) is not included here for prediction
          )
          
          # Predicting rent_cpi using my trained RF model.
          # Caret's predict() on a 'train' object automatically applies the prepped recipe.
          predicted_rent_cpi_value_rf <- predict(active_forecasting_model_rf, newdata = future_row_for_rf_predict)
          
          # Storing the prediction for this month
          current_forecast_row_rf <- tibble(date = current_future_date_rf, 
                                          region = current_forecast_region, 
                                          rent_cpi_forecast = as.numeric(predicted_rent_cpi_value_rf))
          rent_cpi_forecasts_rf_list_region[[i]] <- current_forecast_row_rf
          
          # Updating values for the next iteration's lags
          rent_cpi_previous_is_two_ago_for_next <- rent_cpi_current_is_previous_for_next       
          rent_cpi_current_is_previous_for_next <- as.numeric(predicted_rent_cpi_value_rf) 
          median_price_current_is_previous_for_next <- median_price_fut_rf 
        } # End of inner forecasting loop (for i in 1:num_forecast_periods) for RF for one region

        # Storing the complete RF forecast for the current_forecast_region in my main list
        all_regional_rf_cpi_forecasts[[current_forecast_region]] <- bind_rows(rent_cpi_forecasts_rf_list_region)
        print(paste("Recursive RF forecasting complete for", current_forecast_region))
        
        # Plotting the RF forecast for the current region
        historical_rent_cpi_for_plot_rf <- ml_data_final %>% filter(region == current_forecast_region) %>% select(date, rent_cpi)
        current_rf_plot_data <- bind_rows(
          historical_rent_cpi_for_plot_rf %>% rename(value = rent_cpi) %>% mutate(type = "Actual"),
          all_regional_rf_cpi_forecasts[[current_forecast_region]] %>% rename(value = rent_cpi_forecast) %>% mutate(type = "Forecast (RF Recursive)")
        )
        if(nrow(current_rf_plot_data) > 0) {
          rf_plot <- ggplot(current_rf_plot_data, aes(x = date, y = value, color = type, group = type)) + 
            geom_line(linewidth = 1) +
            scale_colour_manual(values = c("Actual" = "navyblue", "Forecast (RF Recursive)" = "darkgoldenrod1"), name = "Data Type:") +
            scale_y_continuous(labels = comma) +
            labs(title = paste("Rent CPI RF Forecast:", current_forecast_region), 
                 subtitle = "Recursive using ARIMA for Median Price input", y = "Rent CPI") +
            theme(legend.position = "top") 
          print(rf_plot)
        }
      } # End else for last_observation_rf_startup check
  } # End else for future_median_prices_for_rf_loop check
  
  
} # --- END OF THE 'for (current_forecast_region in unique_regions_to_forecast)' LOOP ---
## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Auckland ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Auckland using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Auckland fitted: "

## [1] "Recursive RF forecasting complete for Auckland"

## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Wellington ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Wellington using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Wellington fitted: "

## [1] "Recursive RF forecasting complete for Wellington"

## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Rest of North Island ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Rest of North Island using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Rest of North Island fitted: "

## [1] "Recursive RF forecasting complete for Rest of North Island"

## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Canterbury ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Canterbury using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Canterbury fitted: "

## [1] "Recursive RF forecasting complete for Canterbury"

## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in Rest of South Island ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for Rest of South Island using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in Rest of South Island fitted: "

## [1] "Recursive RF forecasting complete for Rest of South Island"

## [1] "\n === METHOD 2: Processing Recursive RF Forecast for Rent CPI in National ==="
## [1] "--- Sub-step for Method 2: Forecasting median_price for National using ARIMA ---"
## [1] "ARIMA for median_price (input for RF) in National fitted: "

## [1] "Recursive RF forecasting complete for National"

# Now that the loop has finished for all specified regions, I can proceed to the combined plot.

print("--- All Regional Forecasting Attempts (Method 1 & 2) Completed ---")
## [1] "--- All Regional Forecasting Attempts (Method 1 & 2) Completed ---"

Insights

Helper Median Price ARIMA Forecasts (Inputs for RF):

1. Auckland:
The ARIMA model (console previously indicated ARIMA(0,1,1)(0,1,1)[12] with drift) projects Auckland’s median price to increase from its early 2020 level (~$900K-$1M) at a subdued rate, reaching roughly $1.1M by 2030.
Uncertainty: Prediction intervals are very wide.

2. Canterbury:
The ARIMA model (plot subtitle indicates ARIMA(0,1,0)(0,1,1)[12] with drift) projects Canterbury’s median price to continue its upward trend from ~$450K-$500K in early 2020, reaching around $650K-$700K by 2030.
Uncertainty: Prediction intervals are wide and expand.

3. National:
The ARIMA model (plot subtitle indicates ARIMA(0,1,2)(0,1,1)[12] with drift) projects the national median price to increase from ~$600K-$700K in early 2020 to around $850K-$900K by 2030, showing a continued but modest upward trend.
Uncertainty: Prediction intervals are substantial and widen.

4. Rest of North Island:
The ARIMA model (plot subtitle indicates ARIMA(0,1,0)(0,1,1)[12] with drift) projects a strong and continuous upward trend for median prices, rising from ~$500K-$600K in early 2020 to potentially over $1.5M by 2030.
Uncertainty: Prediction intervals are extremely wide.

5. Rest of South Island:
The ARIMA model (plot subtitle indicates ARIMA(0,1,1)(0,1,1)[12] with drift) projects median prices to continue increasing from ~$450K in early 2020 to around $700K-$750K by 2030.
Uncertainty: Prediction intervals are wide.

6. Wellington:
The ARIMA model (plot subtitle indicates ARIMA(1,1,0)(0,1,1)[12] with drift) projects a significant and continued upward trend for Wellington’s median price, from ~$700K-$800K in early 2020 to potentially over $1.6M by 2030.
Uncertainty: Prediction intervals are very wide.


Rent CPI RF Forecasts (by Region):

1. Auckland:
Rent CPI RF Forecast: The forecast (orange line) shows an initial slight dip/flattening from early 2020 levels (~1500-1550 CPI), then maintains a very gentle upward trend with minor fluctuations, remaining largely stable around 1500-1550 through to 2030. This suggests a significant slowdown compared to its strong historical trend and contrasts with the more aggressive direct ARIMA forecast for Rent CPI (Method 1).
Insight (Auckland): The RF model, influenced by a decelerating median price forecast, projects a stabilization of Rent CPI.

2. Canterbury:
Rent CPI RF Forecast: The forecast shows considerable volatility but generally an upward trend, rising from ~1400-1450 in early 2020 to around 1500-1550 by 2030. The forecast path is much less smooth than for other regions. This suggests a continued, albeit somewhat erratic, increase in Rent CPI.
Insight (Canterbury): The RF forecast suggests continued, but volatile, Rent CPI growth, influenced by a moderately increasing median price projection.

3. National:
Rent CPI RF Forecast: The forecast shows the national Rent CPI remaining very stable, hovering around the 1500-1520 level throughout the forecast period to 2030, with minor fluctuations. This indicates a significant flattening compared to the strong historical upward trend.
Insight (National): The RF model projects a stabilization of the national Rent CPI, likely reflecting the more moderate growth projected for national median prices.

4. Rest of North Island:
Rent CPI RF Forecast: The forecast shows an initial dip, then a period of relative stability around the 1500-1550 CPI level, with some fluctuations but no strong upward trend through to 2030. This is a surprising outcome given the very strong upward projection for its median price input.
Insight (Rest of North Island): Despite a very strong upward median price forecast, the RF model projects a relatively flat Rent CPI. This divergence suggests complex model interactions or limitations.

5. Rest of South Island:
Rent CPI RF Forecast: The forecast shows an initial period of volatility, then stabilizes around the 1450-1500 CPI level, with some fluctuations but no clear strong upward trend through to 2030. This indicates a stabilization after recent historical volatility.
Insight (Rest of South Island): The RF model projects a stabilization of Rent CPI, influenced by a moderately increasing median price forecast.

6. Wellington:
Rent CPI RF Forecast: The forecast shows Rent CPI remaining relatively stable around the 1500-1550 mark, with some fluctuations, after an initial slight dip from its early 2020 peak. Similar to “Rest of North Island,” this relatively flat Rent CPI forecast is counterintuitive given the strong upward projection for its median price input.
Insight (Wellington): Despite a strong upward median price forecast, the RF model projects a relatively stable Rent CPI, suggesting the model is not solely driven by the price input in this scenario.


Overall Comparative Summary & Insights (Method 2: Recursive RF)

General Trend - Stabilization/Modest Growth:
Unlike the direct ARIMA forecasts for Rent CPI (Method 1) which often showed strong continued upward trends, the recursive Random Forest forecasts (Method 2) generally project a period of stabilization or very modest growth for Rent CPI across most regions (Auckland, National, Rest of North Island, Rest of South Island, Wellington). Canterbury is somewhat an exception, showing more volatile but still upwardly trended growth.

Influence of Median Price Forecasts:
The Rent CPI forecasts from the RF model are inherently linked to the “helper” ARIMA forecasts for median property prices. Regions where median prices were forecasted to grow more slowly or stabilize (e.g., Auckland, National) also saw flatter RF Rent CPI forecasts. However, for regions with very aggressive median price forecasts (e.g., Rest of North Island, Wellington), the RF Rent CPI forecast did not always follow suit with similar aggression, often showing stabilization instead. This might indicate:

  • The RF model has learned complex non-linear relationships where price increases don’t always translate proportionally to rent increases, especially if other factors (held constant or derived) don’t support it.
  • The RF model might be hesitant to extrapolate Rent CPI values far beyond the range seen in its training data, even if one input predictor (median price) is forecasted to move strongly.
  • The approximation of price_rent_cpi_ratio using lagged Rent CPI might dampen the responsiveness.

Uncertainty in Inputs:
The significant uncertainty (wide prediction intervals) in the helper ARIMA forecasts for median prices is a major underlying source of uncertainty for these RF Rent CPI forecasts, even though explicit prediction intervals are not generated by this recursive RF method.

Volatility in Forecasts:
Some regions, like Canterbury and Rest of South Island, show more inherent volatility in their RF Rent CPI forecasts, likely reflecting the historical volatility in their data that the RF model attempts to capture or is influenced by.

7.3. Comparison of Regional Forecasts

Now that I have generated Rent CPI forecasts using two different methods (direct ARIMA for Rent CPI, and a recursive Random Forest model using ARIMA-forecasted median prices), I think it is very useful to plot these different forecasts together on the same chart, along with the actual historical data. This will allow me to directly compare their trajectories, levels, and how they perform relative to each other for the selected region(s).

If I run the forecasting loop for multiple regions, this combined plot will be faceted, making it easy to see these comparisons across different parts of New Zealand.

# This chunk assumes 'all_regional_arima_cpi_forecasts' and 'all_regional_rf_cpi_forecasts'
# lists have been populated by the preceding regional_forecasting_loop_chunk.
# It also uses 'full_data' for historical actuals and 'unique_regions_to_forecast' for filtering.

print("--- Generating Combined Forecast Comparison Plot ---")
## [1] "--- Generating Combined Forecast Comparison Plot ---"
# This plot will only be generated if forecasts were successfully created for at least one region by at least one method.
# The condition also checks if 'unique_regions_to_forecast' is defined.
if (exists("unique_regions_to_forecast") && length(unique_regions_to_forecast) > 0 && 
    ((exists("all_regional_arima_cpi_forecasts") && length(all_regional_arima_cpi_forecasts) > 0) || 
     (exists("all_regional_rf_cpi_forecasts") && length(all_regional_rf_cpi_forecasts) > 0)) ) {
      
  # Binding all the ARIMA forecasts together (if any were successful)
  final_arima_forecasts_all_regions <- bind_rows(all_regional_arima_cpi_forecasts)
  # Binding all the Random Forest forecasts together (if any were successful)
  final_rf_forecasts_all_regions <- bind_rows(all_regional_rf_cpi_forecasts)

  # Proceed only if there is something to plot from at least one forecasting method.
  if(nrow(final_arima_forecasts_all_regions) > 0 || nrow(final_rf_forecasts_all_regions) > 0) {
    
    # Preparing historical data for all regions that we attempted to forecast.
    historical_data_all_regions_plot <- full_data %>%
      filter(region %in% unique_regions_to_forecast) %>% 
      select(date, region, rent_cpi) %>%
      rename(value = rent_cpi) %>%
      mutate(method = "Actual", type = "Actual") # Adding columns for easy binding and plotting

    # Preparing ARIMA forecast data for plotting, if it exists.
    arima_plot_df <- if(nrow(final_arima_forecasts_all_regions) > 0) {
      final_arima_forecasts_all_regions %>% 
        # Selecting only necessary columns and renaming for consistency before bind_rows.
        # Also including prediction interval columns if I want to plot them.
        select(date, region, 
               value = rent_cpi_forecast_arima,
               lower95 = lower95_arima, # Taking 95% PI
               upper95 = upper95_arima) %>% 
        mutate(method = "ARIMA Direct", type = "Forecast (ARIMA)")
    } else { 
      # Return an empty tibble with the correct structure if no ARIMA forecasts.
      tibble(date=as.Date(character(0)), region=character(0), value=numeric(0), 
             lower95=numeric(0), upper95=numeric(0), method=character(0), type=character(0)) 
    }
    
    # Preparing Random Forest forecast data for plotting, if it exists.
    rf_plot_df <- if(nrow(final_rf_forecasts_all_regions) > 0) {
      final_rf_forecasts_all_regions %>% 
        rename(value = rent_cpi_forecast) %>% 
        mutate(method = "RF Recursive", type = "Forecast (RF Recursive)") %>%
        # Adding empty PI columns for consistent structure with ARIMA df, though RF doesn't provide them directly here.
        mutate(lower95 = NA_real_, upper95 = NA_real_) %>% 
        select(date, region, value, type, method, lower95, upper95) 
    } else { 
      tibble(date=as.Date(character(0)), region=character(0), value=numeric(0), 
             lower95=numeric(0), upper95=numeric(0), method=character(0), type=character(0)) 
    }

    # Combining all data: actuals, ARIMA forecasts, and RF forecasts.
    comparison_plot_data_all <- bind_rows(
      historical_data_all_regions_plot,
      arima_plot_df,
      rf_plot_df
    ) %>% 
    filter(!is.na(value)) # Ensuring no NA values interfere with plotting lines.

    if(nrow(comparison_plot_data_all) > 0){
      # Ensuring 'type' is a factor for consistent plotting order and legend.
      type_levels <- c("Actual") # Actual is always first.
      if(nrow(arima_plot_df) > 0 && !"Forecast (ARIMA)" %in% type_levels) type_levels <- c(type_levels, "Forecast (ARIMA)")
      if(nrow(rf_plot_df) > 0 && !"Forecast (RF Recursive)" %in% type_levels) type_levels <- c(type_levels, "Forecast (RF Recursive)")
      comparison_plot_data_all$type <- factor(comparison_plot_data_all$type, levels = type_levels, ordered = TRUE)

      # Creating the comparison plot.
      all_regions_forecast_plot <- ggplot(comparison_plot_data_all, aes(x = date, y = value, colour = type, linetype = type)) +
        geom_line(linewidth = 0.8) + # Line for actuals and mean forecasts
        # Adding prediction intervals ONLY for ARIMA if they exist
        geom_ribbon(data = filter(comparison_plot_data_all, type == "Forecast (ARIMA)" & !is.na(lower95) & !is.na(upper95)),
                    aes(ymin=lower95, ymax=upper95, x=date), 
                    fill="deepskyblue3", alpha=0.15, inherit.aes=FALSE, show.legend = FALSE) +
        # Faceting by region. Adjust ncol based on how many regions I am forecasting.
        facet_wrap(~region, scales = "free_y", ncol = min(3, length(unique(comparison_plot_data_all$region)))) + 
        scale_colour_manual(values = c("Actual" = "black", "Forecast (ARIMA)" = "deepskyblue3", "Forecast (RF Recursive)" = "orangered2"), 
                            name = "Data/Method:", drop = FALSE) + # drop=FALSE keeps all levels in legend even if one method failed for a region.
        scale_linetype_manual(values = c("Actual" = "solid", "Forecast (ARIMA)" = "dashed", "Forecast (RF Recursive)" = "dotdash"), 
                              name = "Data/Method:", drop = FALSE) +
        scale_y_continuous(labels = comma) +
        labs(title = "Rent CPI Forecast Comparison Across Selected Regions",
             subtitle = "ARIMA Direct vs. Recursive Random Forest (RF uses ARIMA-forecasted Median Price input)",
             x = "Date", y = "Rent CPI") +
        theme(legend.position = "top", legend.key.width = unit(1.5, "cm"),
              strip.text = element_text(size=rel(0.8))) # Adjusting facet label size.
      print(all_regions_forecast_plot)
    } else {
      print("Not enough data to generate combined forecast plot after preparing forecast dataframes.")
    }
  } else {
    print("Could not generate combined forecast plot; no forecast data from any method for any region.")
  }
} else {
    print("Combined plot for multiple regions skipped as conditions not met (e.g., only one region processed by a single method, or forecast data missing).")
}
Plot 7.3: Comparison of Rent CPI Forecasts from Different Methods.

Plot 7.3: Comparison of Rent CPI Forecasts from Different Methods.

This analysis compares the Rent CPI forecasts generated by two distinct methods:

Method 1: Direct ARIMA (dashed blue line with shaded 95% prediction intervals) - Univariate forecast based only on historical Rent CPI.

Method 2: Recursive Random Forest (RF) (dot-dash orange line) - Multivariate forecast using ARIMA-forecasted median prices and other predictors.

Insights

The plots show historical actual Rent CPI (solid black line) followed by the forecasts from both methods for each of the six broad New Zealand regions, extending to the end of 2030.

  1. Auckland
    ARIMA Direct Forecast: Projects a continued strong upward trend for Rent CPI, rising from ~1550 to ~1900-2000 by 2030. The 95% prediction interval (light blue shading) becomes very wide, indicating high uncertainty.
    RF Recursive Forecast: Shows a significant departure, projecting Rent CPI to stabilize or grow very modestly, remaining around the 1500-1550 level throughout the forecast horizon.
    Comparison: The two methods offer starkly different outlooks for Auckland. The ARIMA model extrapolates the strong historical trend, while the RF model, likely influenced by its more subdued median price input forecast, suggests a flattening.
    Insight (Auckland): Significant divergence between methods; ARIMA projects continued growth, RF projects stabilization. The RF forecast implies a structural break from past Rent CPI growth trends, driven by its interpretation of future predictor paths.

  2. Canterbury
    ARIMA Direct Forecast: Projects a gentle upward trend from ~1400-1450 to around 1600-1700 by 2030, with very wide prediction intervals.
    RF Recursive Forecast: Also shows an upward trend, but with more pronounced volatility and a slightly higher trajectory, reaching around 1500-1550 by 2030, but with notable month-to-month fluctuations.
    Comparison: Both models project an increase, but the RF forecast is more volatile. The ARIMA point forecast falls within the general path of the RF forecast initially but then appears slightly more optimistic in the long run.
    Insight (Canterbury): Both methods suggest Rent CPI growth, but the RF model captures more short-term volatility. The long-term ARIMA trend is slightly more bullish than the central path of the RF’s fluctuations.

  3. National
    ARIMA Direct Forecast: Extrapolates the strong, steady historical upward trend, projecting Rent CPI to rise from ~1520 towards 2000 by 2030. Prediction intervals widen significantly.
    RF Recursive Forecast: Projects a clear stabilization of the national Rent CPI, hovering around the 1500-1520 level with minor fluctuations.
    Comparison: Similar to Auckland, there’s a major divergence. ARIMA continues the historical growth, while RF predicts a flattening of the national trend.
    Insight (National): The two methods provide contrasting national outlooks. RF suggests the national Rent CPI growth seen historically will not continue at the same pace, likely due to its more conservative median price input forecast.

  4. Rest of North Island
    ARIMA Direct Forecast: Projects a very strong, almost exponential, continuation of the accelerating upward trend, with Rent CPI forecasted to rise dramatically (note the y-axis scale goes up to 3,000,000, which seems to be an error in the plot’s y-axis scaling for this facet, the actual forecast value would be much lower, likely aiming for >2500 based on previous plots for this method). The line itself shows a very steep increase.
    RF Recursive Forecast: Shows an initial dip from the ~1600 level, then stabilizes around the 1500-1550 mark with some fluctuations.
    Comparison: Extreme divergence. The ARIMA forecast is exceptionally bullish (though the y-axis scaling in the plot is problematic for interpretation of the absolute level, the trajectory is clear). The RF forecast, despite a very strong median price input forecast for this region (as seen in Method 2 analysis), still results in a relatively flat Rent CPI projection.
    Insight (Rest of North Island): The ARIMA forecast appears unrealistic due to its extreme extrapolation (and potential plotting scale issue). The RF forecast suggests stabilization, indicating the model is not translating the aggressive median price input into similarly aggressive Rent CPI growth.

  5. Rest of South Island
    ARIMA Direct Forecast: Projects a largely flat trend around the 1300-1350 level, influenced by the recent sharp drop in historical data. Prediction intervals are extremely wide, indicating massive uncertainty.
    RF Recursive Forecast: Shows initial volatility then stabilizes around the 1450-1500 level, slightly higher than the ARIMA forecast, with ongoing fluctuations.
    Comparison: Both methods project a relatively flat future, but the RF forecast stabilizes at a slightly higher level after some initial volatility. The ARIMA forecast is heavily anchored by the recent downturn.
    Insight (Rest of South Island): Both models suggest an end to the strong historical growth, but the RF model is less pessimistic than the ARIMA model, which is heavily influenced by the latest data points. High uncertainty remains a key feature for ARIMA.

  6. Wellington
    ARIMA Direct Forecast: Shows a continued strong, somewhat cyclical/seasonal upward trend, with Rent CPI projected to rise significantly (note the y-axis scale goes up to 60,000, which is another plotting scale issue; the actual forecast would be aiming for >2500 based on previous plots for this method). The trajectory is steeply upward.
    RF Recursive Forecast: After an initial dip from its early 2020 peak (above 1600), the RF forecast stabilizes around the 1500-1550 level with fluctuations.
    Comparison: Significant divergence. The ARIMA forecast is very bullish (again, y-axis scaling issue). The RF forecast, despite a strong median price input forecast for Wellington, projects stabilization of Rent CPI.
    Insight (Wellington): Similar to “Rest of North Island,” the ARIMA extrapolation seems extreme. The RF model predicts stabilization, not directly mirroring the aggressive median price input, suggesting other factors or model learning are at play.

8. Conclusion and Next Steps

This project has been a comprehensive journey into analysing the New Zealand housing market, focusing on understanding and predicting Rent CPI. Starting with raw data, I went through a detailed process of data cleaning, standardisation, and integration, which was particularly challenging due to the different regional granularities and time frequencies of the source files. The Exploratory Data Analysis (EDA) phase, with its ~15 visualisations, helped me uncover various trends, distributions, and relationships within the housing data. Following that, I engineered several new features which I believed would enhance the predictive power of my models.

8.1. Key Findings and Achievements

  • Data Integration: I successfully merged rental CPI data, granular property price data (by mapping specific locations to broader regions and averaging where necessary), and national quarterly dwelling statistics into a cohesive monthly dataset (full_data) suitable for analysis.
  • EDA Insights: The EDA revealed significant regional variations in both Rent CPI and median property prices, with major centres like Auckland often exhibiting distinct patterns and higher volatility. I also explored correlations and distributions which informed my feature engineering and modelling choices.
  • Regression Modelling: I trained and rigorously evaluated four different machine learning models (Linear Regression, Lasso, Random Forest, and XGBoost) to predict Rent CPI. This involved:
    • A robust preprocessing recipe including imputation, dummy variable creation, normalisation, and handling of correlated predictors.
    • Hyperparameter tuning using 10-fold cross-validation.
  • Model Performance: My results consistently showed that the ensemble models, Random Forest and XGBoost, significantly outperformed the linear models. Based on my last run, Random Forest achieved a test set RMSE of around 26.1, and XGBoost was very close with an RMSE of about 27.4, both vastly better than the linear models (RMSE ~76). This gives me good confidence in their predictive capabilities for this dataset.
  • Initial Forecasting: I implemented two initial forecasting methods for regional Rent CPI (demonstrated for Auckland):
    • A direct ARIMA model applied to the historical Rent CPI series.
    • A recursive Random Forest forecast, which used an ARIMA model (with a log transformation for stability) to project future median_price as a key input. These initial forecasts provided a baseline and highlighted the complexities of long-term prediction, especially regarding assumptions for exogenous variables.

8.2. Limitations of Current Analysis

While I am pleased with the progress, I also recognise some limitations in the current analysis:

  • Exogenous Predictor Forecasts: For the recursive Random Forest model, the future values of median_price were based on a univariate ARIMA forecast, and national dwelling ratios were held constant. More sophisticated forecasting for these key inputs would likely improve the Rent CPI forecast accuracy.
  • price_rent_cpi_ratio Approximation: In the recursive RF forecast, the price_rent_cpi_ratio feature (which originally used contemporaneous rent_cpi) was approximated using the previous month’s rent_cpi. Retraining the RF model with a truly lagged version of this feature would be more robust.
  • Scope of Regional Forecasts in Report: For brevity in this R Markdown document, the detailed forecasting was demonstrated primarily for one region. A full analysis would involve running and comparing these methods across all key regions.
  • Warnings: My R console indicated “50 or more warnings” during the machine learning phase. While the models produced results, a thorough investigation of these warnings would be necessary to ensure complete model stability and address any potential underlying data issues or model fitting quirks.
  • Prediction Intervals for ML Forecasts: My current Random Forest forecast is a point forecast. Generating reliable prediction intervals for such machine learning models requires more advanced techniques (like bootstrapping or quantile regression forests) which I have not yet implemented.

8.3. Future Work and Directions

This project has laid a strong foundation, and there are several exciting avenues I am keen to explore next to build upon this work:

  1. Enhance Forecasting Capabilities:
    • Robust Exogenous Forecasts: Develop more sophisticated time series models (e.g., ETS, Prophet, or even multivariate models like VAR) for key predictors like median_price for each region.
    • Refine Recursive Model Features: Retrain the Random Forest/XGBoost models with features explicitly designed for recursive forecasting (e.g., using lag(rent_cpi) in the price-to-rent ratio).
    • Expand Regional Analysis: Systematically apply and compare both ARIMA and recursive ML forecasting methods across all defined regions.
    • Prediction Intervals: Implement methods to generate prediction intervals for the machine learning forecasts.
  2. Explore Other Time Series Models:
    • Prophet: Utilise Facebook’s Prophet model, which is often very effective for time series with seasonality and trend components.
    • Advanced ARIMA (SARIMAX): Explore SARIMAX models to incorporate exogenous predictors directly into the ARIMA framework for rent_cpi.
  3. Deep Learning for Time Series Forecasting:
    • This was one of my original interests. I plan to prepare the data in the required sequence format (samples, timesteps, features) and experiment with LSTMs (Long Short-Term Memory) or GRUs (Gated Recurrent Units) using keras or torch in R.
  4. Scenario Analysis: Once more robust forecasting models are in place, I would like to conduct scenario analysis by inputting different future paths for key economic drivers (e.g., interest rate changes, different levels of housing supply or migration, if I can source or project these) to see how they impact the housing market forecasts.
  5. Full Warning Investigation: Systematically go through and address all warnings generated during the model training process to ensure maximum robustness.

Overall, this project has been an invaluable learning experience in end-to-end data analysis, from data integration and EDA through to advanced modelling and the initial stages of forecasting. I am excited about the potential to further refine these models and generate even more insightful predictions about the New Zealand housing market.

END OF REPORT


9. Connect With Me

Thanks for taking the time to go through my property market analysis! I really enjoyed digging into this data, building the models, and exploring the forecasting side of things. It has been a fantastic learning journey.

If you would like to see more of my projects or connect professionally, you can find me here:

Cheers!